Parsing HTML

Go:
Parsing HTML

How to:

To parse HTML in Go, you typically use the goquery package or the standard library’s net/html package. Here’s a basic example using net/html to extract all links from a webpage:

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "net/http"
)

func main() {
    // Get HTML document
    res, err := http.Get("http://example.com")
    if err != nil {
        panic(err)
    }
    defer res.Body.Close()

    // Parse the HTML document
    doc, err := html.Parse(res.Body)
    if err != nil {
        panic(err)
    }

    // Function to recursively traverse the DOM
    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    fmt.Println(a.Val)
                    break
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }

    // Traverse the DOM
    f(doc)
}

Sample output (assuming http://example.com contains two links):

http://www.iana.org/domains/example
http://www.iana.org/domains/reserved

This code requests an HTML page, parses it, and recursively traverses the DOM to find and print href attributes of all <a> tags.

Deep Dive

The net/html package provides the basics for parsing HTML in Go, directly implementing the tokenization and tree construction algorithms specified by the HTML5 standard. This low-level approach is powerful but can be verbose for complex tasks.

In contrast, the third-party goquery package, inspired by jQuery, offers a higher-level interface that simplifies DOM manipulation and traversal. It allows developers to write concise and expressive code for tasks like element selection, attribute extraction, and content manipulation.

However, goquery’s convenience comes at the cost of an additional dependency and potentially slower performance due to its abstraction layer. The choice between net/html and goquery (or other parsing libraries) depends on the specific requirements of the project, such as the need for performance optimization or ease of use.

Historically, HTML parsing in Go has evolved from basic string operations to sophisticated DOM tree manipulation, reflecting the language’s growing ecosystem and the community’s demand for robust web scraping and data extraction tools. Despite native capabilities, the prevalence of third-party libraries like goquery highlights the Go community’s preference for modular, reusable code. However, for performance-critical applications, programmers might still favor the net/html package or even resort to regex for simple parsing tasks, keeping in mind the inherent risks and limitations of regex-based HTML parsing.

Last updated on March 13, 2024

Downloading a web page Sending an HTTP request

Go:Parsing HTML

How to:

Deep Dive

Go:
Parsing HTML