C++:
Parsing HTML
How to:
C++ doesn’t come with built-in HTML parsing capabilities. You’ll often use a library like Gumbo-parser by Google, or something similar. Here’s a quick example using Gumbo-parser:
#include <iostream>
#include <gumbo.h>
void search_for_links(GumboNode* node) {
if (node->type != GUMBO_NODE_ELEMENT) {
return;
}
if (node->v.element.tag == GUMBO_TAG_A) {
GumboAttribute* href = gumbo_get_attribute(&node->v.element.attributes, "href");
if (href) {
std::cout << href->value << std::endl;
}
}
GumboVector* children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i) {
search_for_links(static_cast<GumboNode*>(children->data[i]));
}
}
int main() {
const char* html = "<html><body><a href='https://example.com'>Link</a></body></html>";
GumboOutput* output = gumbo_parse(html);
search_for_links(output->root);
gumbo_destroy_output(&kGumboDefaultOptions, output);
return 0;
}
Sample output:
https://example.com
Deep Dive
Parsing HTML hasn’t always been straightforward in C++. Historically, programmers would use regex or hand-written parsers, both of which are error-prone and cumbersome. Nowadays, robust libraries like Gumbo-parser handle the intricacies of parsing, making it easier and more reliable.
Alternatives include Tidy, MyHTML, or even integrating C++ with Python’s BeautifulSoup via the C++ system
function or embedded interpreters.
Implementation-wise, these libraries convert HTML to a Document Object Model (DOM) tree. Traversing and manipulating the DOM allows users to extract and work with data as demonstrated in the How to section.