Parsing Amazon with Hpricot
Posted by hardwarehank, Thu Jul 06 01:07:34 UTC 2006
_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it:
require 'rubygems' require_gem 'hpricot' require 'open-uri' puts "Grabbing Page..." html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155") puts "Parsing..." doc = Hpricot.parse(html) (doc.search("//table//td[@id='prodImageCell']")/:img).each do |link| p link.attributes end
{"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"}
ruby -rrubygems -ropen-uri -e "require 'hpricot';(Hpricot.parse(open('http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155')).search(\"//table//td[@id='prodImageCell']\")/:img).each {|link| p link.attributes }"
Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the page, not parsing!
Also, “Sunset, Sunrise” by Razor Ramon is awesome.

Blog Posts
August 13, 2007 @ 05:22 PM
Yes, Hpricot is great. I’ve tried it for a while locally and would like to use it on my web apps, but it’s hard to set up on Dreamhost as the gem is not installed there. Any clues?
August 13, 2007 @ 09:47 PM
You can set up a gem directory in your home directory.