Parsing HTML

Ruby:
Parsing HTML

How to:

To parse HTML in Ruby, install the ‘Nokogiri’ gem with gem install nokogiri. Nokogiri is like a Swiss Army knife for working with HTML and XML in Ruby. Here’s a quick example:

require 'nokogiri'
require 'open-uri'

# Load HTML content from a website
html_content = URI.open('http://example.com').read

# Parse the HTML
doc = Nokogiri::HTML(html_content)

# Extract the title
title = doc.xpath('//title').text
puts "The title of the page is: #{title}"

This spits out something like: The title of the page is: Example Domain.

Deep Dive

Back in the early Ruby days, options for parsing HTML were limited. REXML was built-in but slow. Then Hpricot showed up, but it fizzled out. Nokogiri debuted in 2008, blending the ease of Hpricot with the speed and power of libxml, a proven XML toolkit.

In the parsing world, there are always alternatives. Some swear by the built-in ‘rexml’ library or ‘oga’, another XML/HTML parser for Ruby. But Nokogiri remains a favorite for its robustness and speed, not to mention its vast array of features.

Under the hood, Nokogiri converts HTML into a Document Object Model (DOM)—a tree structure. This makes it easy to navigate and manipulate elements. Using XPath and CSS selectors, you can pinpoint any piece of information you need.

Ruby:Parsing HTML

How to:

Deep Dive

See Also

Ruby:
Parsing HTML