Java:
Парсинг HTML
How to: (Як це зробити:)
Let’s use Jsoup, a Java library for HTML parsing. Add it to your project with Maven:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
Here’s a snippet that fetches a webpage and grabs the title:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class HtmlParserDemo {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://example.com").get();
String title = doc.title();
System.out.println("Title: " + title);
}
}
Sample output:
Title: Example Domain
Deep Dive (Поглиблений Розгляд):
Historically, parsing HTML was complicated due to messy code and lack of standards. Tools like Jsoup revolutionized the process by providing a forgiving parser and a jQuery-like interface.
Alternatives to Jsoup include HtmlUnit for testing web apps, and XPath for queries on XML-like structures.
Implementation-wise, Jsoup uses a DOM (Document Object Model) approach, building an in-memory tree representation of the HTML. It can fix bad HTML on the fly, making it resilient to real-world web pages.