Go:
Parsing HTML
How to:
To parse HTML in Go, you typically use the goquery
package or the standard library’s net/html
package. Here’s a basic example using net/html
to extract all links from a webpage:
package main
import (
"fmt"
"golang.org/x/net/html"
"net/http"
)
func main() {
// Get HTML document
res, err := http.Get("http://example.com")
if err != nil {
panic(err)
}
defer res.Body.Close()
// Parse the HTML document
doc, err := html.Parse(res.Body)
if err != nil {
panic(err)
}
// Function to recursively traverse the DOM
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "a" {
for _, a := range n.Attr {
if a.Key == "href" {
fmt.Println(a.Val)
break
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
// Traverse the DOM
f(doc)
}
Sample output (assuming http://example.com
contains two links):
http://www.iana.org/domains/example
http://www.iana.org/domains/reserved
This code requests an HTML page, parses it, and recursively traverses the DOM to find and print href
attributes of all <a>
tags.
Deep Dive
The net/html
package provides the basics for parsing HTML in Go, directly implementing the tokenization and tree construction algorithms specified by the HTML5 standard. This low-level approach is powerful but can be verbose for complex tasks.
In contrast, the third-party goquery
package, inspired by jQuery, offers a higher-level interface that simplifies DOM manipulation and traversal. It allows developers to write concise and expressive code for tasks like element selection, attribute extraction, and content manipulation.
However, goquery
’s convenience comes at the cost of an additional dependency and potentially slower performance due to its abstraction layer. The choice between net/html
and goquery
(or other parsing libraries) depends on the specific requirements of the project, such as the need for performance optimization or ease of use.
Historically, HTML parsing in Go has evolved from basic string operations to sophisticated DOM tree manipulation, reflecting the language’s growing ecosystem and the community’s demand for robust web scraping and data extraction tools. Despite native capabilities, the prevalence of third-party libraries like goquery
highlights the Go community’s preference for modular, reusable code. However, for performance-critical applications, programmers might still favor the net/html
package or even resort to regex for simple parsing tasks, keeping in mind the inherent risks and limitations of regex-based HTML parsing.