Crawling the web with Ruby

Recently for one of my projects, I needed to write my own custom crawler to crawl some of my favorite blogs. I am a recent Ruby convert and I am totally in love with the language. So i decided to have the crawler in Ruby. After some Googling around, I found out this very neat crawler, Spider written in Ruby by Mike Burns. You can find the source code here. The code is currently maintained by John Nagro.

The crawler is simple and yet has support for Amazon SQS and memcahed, so that you can run multiple crawler processes at the same time.  There are also some drawbacks:

1. The urls ae just crawled in the order they are discovered ( breadth-first search). There is no way to specify the ordering in which the urls should be searched.

2. There is no HTML parsing on top of the crawled data.

The first drawback was not much of a problem for my purposes, coz I needed to crawl all the web pages of a blog. Order of crawling was not a deal for me. The crawler has a way in which you can specify the urls in a regex form which are to be crawled. This served my purposes.

I needed HTML parsing though. But this was easy. I just used the hpricot gem to do HTML parsing on top of the crawled data:

Here is a sample code, on how to crawl a particular blog using Spider and using hpricot to print out the meta content on each page.

  1.  
  2. require 'rubygems'
  3. require 'spider'
  4. require 'hpricot'
  5.  
  6. Spider.start_at('http://simpliflying.com/') do |s|
  7. # Limit the pages to just this domain.
  8. s.add_url_check do |a_url|
  9. a_url =~ %r{^http://simpliflying.com.*}
  10. end
  11.  
  12. # Handle 404s.
  13. s.on 404 do |a_url, err_code|
  14. puts "URL not found: #{a_url}"
  15. end
  16.  
  17. # Handle 2xx.
  18. s.on :success do |a_url, resp, prior_url|
  19. puts "#{a_url}: #{resp.code}"
  20. doc = Hpricot(resp.body)
  21. (doc/"meta").each do |meta|
  22. puts meta.attributes['content']
  23. end
  24. end
  25.  
  26. # Handle everything.
  27. s.on :any do |a_url, resp|
  28. puts "URL returned anything: #{a_url} with this code #{resp.code}"
  29. end
  30. end
  31.  

A very good Hpricot documentation is available here.

Bookmark and Share

Leave a Comment