Robots.txt and Sitemap.xml files are extremely useful when managing web crawlers but they can also be dangerous when they give away too much information. So what are these files and how are they used?
Many years ago, web site owners and web crawlers developed a method to allow web site owners to politely ask web crawlers not to crawl certain portions of a site. This is done by defining a robots.txt file (Robots Exclusion Standard). For the most part, this standard is adhered to by web crawlers.
A typical robots.txt file looks like this.
#Google Search Engine Robot
#Yahoo! Search Engine Robot
This is a portion of Twitter’s robots.txt file. Notice how Twitter tells the search engines which portions of the site are allowed to be crawled and not allowed to be crawled.
Tonight, while surfing the web I found this robots.txt file. I’ll let you guess which site it goes to.
Notice that there are no Disallow directives. Based on the accepted convention, the web site owner gives all web robots permission to crawl the entire site. Theoretically, you could write your own robot and legally crawl the entire site.
Along with the agreement to use robots.txt files, web site owners and web crawlers also decided to use a sitemap.xml file to explicitly define the structure of the web site and the URLs “on a website that are available for crawling” (Sitemaps). The Sitemap directive can be added to the robots.txt file to tell web crawlers where to find the sitemap file.
If we look at the sitemap for data.healthcare.gov we can see the URLs, which by convention, we are EXPECTED to crawl or visit as users.
This sitemap file tells us about two additional sitemap files. The sitemap-users-data.healthcare.gov0.xml file looks interesting.
This sitemap file tells us the profile link for a number of user accounts. In fact, it provides links to approximately 3900 user accounts. Again, based on convention, robots are EXPECTED to visit each of these links and download the page at the link. You can see that Google did exactly this by running this query.
So, let’s download a page. When you visit a page in a browser, the page is downloaded and rendered by the browser. When a robot or search engine downloads the page they read and parse the HTML code that makes up the page. To see this HTML code, you can right-click on the page in your browser and choose View Source, or something similar depending on your browser. The source code looks like this.
Note, the source code is full of links to other portions of the data.healthcare.gov site, which, by convention, we are allowed to crawl because the robots.txt file does not define any disallowed portions of the site. One such link, /api/users/ta6q-868x?method=contact, is found about 1/3 of the way down the page. Visiting this page produces an error message in JSON format, which means a web crawler like Google will likely ignore this page.
"code" : "method_not_allowed",
"error" : true,
"message" : "GET not allowed"
On a more serious note, a typical attack against websites includes enumerating user accounts and then attempting to brute-force the associated password. Typically, an attacker has to work to find a method to enumerate user accounts but in this case the sitemap file provides a list of user accounts. Personally, I think it would be wise to remove the sitemap file at