Java:
Парсинг HTML

How to: (Як це зробити:)

Let’s use Jsoup, a Java library for HTML parsing. Add it to your project with Maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Here’s a snippet that fetches a webpage and grabs the title:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlParserDemo {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://example.com").get();
        String title = doc.title();
        System.out.println("Title: " + title);
    }
}

Sample output:

Title: Example Domain

Deep Dive (Поглиблений Розгляд):

Historically, parsing HTML was complicated due to messy code and lack of standards. Tools like Jsoup revolutionized the process by providing a forgiving parser and a jQuery-like interface.

Alternatives to Jsoup include HtmlUnit for testing web apps, and XPath for queries on XML-like structures.

Implementation-wise, Jsoup uses a DOM (Document Object Model) approach, building an in-memory tree representation of the HTML. It can fix bad HTML on the fly, making it resilient to real-world web pages.

See Also (Дивіться також):