Java:
Parsing HTML

How to:

Let’s use Jsoup, a handy library for working with real-world HTML. First, add the dependency:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.2</version>
</dependency>

Now to the fun part. Here’s how to grab a webpage’s title and print it:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlParser {
    public static void main(String[] args) throws IOException {
        String url = "http://example.com";
        Document doc = Jsoup.connect(url).get();
        String title = doc.title();
        System.out.println("Title: " + title);
    }
}

Output:

Title: Example Domain

How about extracting all the links?

import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

// ... inside the main or another method
Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println("Link: " + link.attr("href"));
}

Deep Dive

Once upon a time, HTML was tamed by regex patterns, a method both error-prone and nightmarish for complex documents. Enter Jsoup in the late aughts, providing a jQuery-like interface for Java to parse, traverse, and manipulate HTML.

Jsoup is not the only choice. There’s HtmlUnit for full-fledged web app testing with JavaScript support, but it’s heavier and more complex. For lightweight tasks, Apache Commons Validator is great just for extracting URLs.

Under the hood, Jsoup uses a DOM parser, which models the entire document in memory as a tree. This approach makes selecting and navigating the HTML structure a breeze. Plus, it’s forgiving with sloppy HTML, fixing issues on-the-fly to ensure robust parsing.

Remember, when scraping, always check a site’s robots.txt and terms of service to avoid legal troubles or getting IP-banned.

See Also