Parsing HTML

How to:

Bash isn’t the go-to for parsing HTML, but it can be done with tools like grep, awk, sed, or external utilities like lynx. For robustness, we’ll use xmllint from the libxml2 package.

# Install xmllint if necessary
sudo apt-get install libxml2-utils

# Sample HTML
cat > sample.html <<EOF
  <title>Sample Page</title>
  <h1>Hello, Bash!</h1>
  <p id="myPara">Bash can read me.</p>

# Parse the Title
title=$(xmllint --html --xpath '//title/text()' sample.html 2>/dev/null)
echo "The title is: $title"

# Extract Paragraph by ID
para=$(xmllint --html --xpath '//*[@id="myPara"]/text()' sample.html 2>/dev/null)
echo "The paragraph content is: $para"


The title is: Sample Page
The paragraph content is: Bash can read me.

Deep Dive

Back in the day, programmers used regex-based tools like grep to scan HTML, but that was clunky. HTML isn’t regular—it’s contextual. Traditional tools miss this and can be error-prone.

Alternatives? Plenty. Python with Beautiful Soup, PHP with DOMDocument, JavaScript with DOM parsers—languages with libraries designed to understand HTML’s structure.

Using xmllint in bash scripts is solid for simple tasks. It understands XML, and by extension, XHTML. Regular HTML can be unpredictable, though. It doesn’t always follow XML’s strict rules. xmllint forces HTML into an XML model which works well for well-formed HTML but can stumble on messy stuff.

