Sometimes you need to crawl a web page and extract specific data from it. It doesn’t have to be “evil” things like crawling competitors’ data. For example, you may want to do periodical checks for unreachable links in your own web pages – extract all links on the page and see if they’re still accessible.
For extracting specific parts from some text, we tend to think about regular expressions first. For example, the regular expression for extracting links might look like:
This regular expression is too simple and it can easily cause problems. The quotes may be single or double or they may be totally omitted. There can be spaces on both sides of equal sign. Thinking about all the “special” conditions is difficult.
I recently did a search for such parsers in Ruby, and found two: Hpricot and Nokogiri. Sadly I didn’t get Hpricot to work correctly on my Mac, but Nokogiri did work. I have been using jQuery extensively in web development, so I love the CSS selectors in Nokogiri very much. They also provide searching methods based on XPath.
Of course if you just want to extract the title of the HTML document, using regular expressions is convenient enough and often faster than parsing the entire HTML document into a tree. But in more complex situations, you should consider a HTML parser like Nokogiri because they’re more robust and safer.
And it’s interesting to see that Nokogiri and Hpricot are competing on performance…