Ruby HTML Parsers

Sometimes you need to crawl a web page and extract specific data from it. It doesn’t have to be “evil” things like crawling competitors’ data. For example, you may want to do periodical checks for unreachable links in your own web pages – extract all links on the page and see if they’re still accessible.

For extracting specific parts from some text, we tend to think about regular expressions first. For example, the regular expression for extracting links might look like:

/<a.*?href=".*?">/im

This regular expression is too simple and it can easily cause problems. The quotes may be single or double or they may be totally omitted. There can be spaces on both sides of equal sign. Thinking about all the “special” conditions is difficult.

Of course the better method should be getting a DOM tree from the HTML document, like we do in JavaScript. It’s so easy to get all links in JavaScript – document.getElementsByTagName("a"). For more complex situations, there are so many handy frameworks like jQuery, prototype, etc. Though we normally don’t use JavaScript outside a browser, there are also libraries for parsing HTML documents in other languages (If everyone writes HTML as well-formed XML, we could just use XML parsers :)).

I recently did a search for such parsers in Ruby, and found two: Hpricot and Nokogiri. Sadly I didn’t get Hpricot to work correctly on my Mac, but Nokogiri did work. I have been using jQuery extensively in web development, so I love the CSS selectors in Nokogiri very much. They also provide searching methods based on XPath.

Of course if you just want to extract the title of the HTML document, using regular expressions is convenient enough and often faster than parsing the entire HTML document into a tree. But in more complex situations, you should consider a HTML parser like Nokogiri because they’re more robust and safer.

And it’s interesting to see that Nokogiri and Hpricot are competing on performance…

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove your intelligence before hitting * Time limit is exhausted. Please reload CAPTCHA.