Parsing Amazon with Hpricot

Posted by hardwarehank, Thu Jul 06 01:07:34 UTC 2006

_why made a really sweet HTML parser called Hpricot. This allows you to easily parse a remote document using Open-URI. Here’s how to do it:

require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
puts "Grabbing Page..."
html = open("http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155")
puts "Parsing..."
doc = Hpricot.parse(html)
(doc.search("//table//td[@id='prodImageCell']")/:img).each do |link|
  p link.attributes
end
{"src"=>"http://ec1.images-amazon.com/images/P/1844300439.01._AA240_SCLZZZZZZZ_V54614147_.jpg", "border"=>"0", "id"=>"prodImage", "height"=>"240", "alt"=>"Cobblers", "width"=>"240"}

ruby -rrubygems -ropen-uri -e "require 'hpricot';(Hpricot.parse(open('http://www.amazon.com/gp/product/1844300439/ref=amb_cob_bh_194691301/002-0086113-2532879?n=283155')).search(\"//table//td[@id='prodImageCell']\")/:img).each {|link| p link.attributes }"

Amazing stuff really. The parser is so amazingly fast. All the time is spent fetching the page, not parsing!

Also, “Sunset, Sunrise” by Razor Ramon is awesome.

Tags:

Comments

  • photo of Jaime Iniesta Jaime Iniesta
    August 13, 2007 @ 05:22 PM

    Yes, Hpricot is great. I’ve tried it for a while locally and would like to use it on my web apps, but it’s hard to set up on Dreamhost as the gem is not installed there. Any clues?

  • photo of Hank Hank
    August 13, 2007 @ 09:47 PM

Have your say

A name is required. You may use Markdown in your comments.