Lua:
Parsing HTML
How to:
Lua does not have a built-in library for parsing HTML, but you can utilize third-party libraries like LuaHTML
or leverage bindings for libxml2
through LuaXML
. A popular approach is to use the lua-gumbo
library for parsing HTML, which provides a straightforward, HTML5-compliant parsing capability.
Installing lua-gumbo:
First, ensure lua-gumbo
is installed. You can typically install it using luarocks:
luarocks install lua-gumbo
Basic Parsing with lua-gumbo:
Here’s how you can parse a simple HTML snippet and extract data from it using lua-gumbo
:
local gumbo = require "gumbo"
local document = gumbo.parse[[<html><body><p>Hello, world!</p></body></html>]]
local p = document:getElementsByTagName("p")[1]
print(p.textContent) -- Output: Hello, world!
Advanced Example - Extracting Links:
To extract href
attributes from all anchor tags (<a>
elements) in an HTML document:
local gumbo = require "gumbo"
local document = gumbo.parse([[
<html>
<head><title>Sample Page</title></head>
<body>
<a href="http://example.com/1">Link 1</a>
<a href="http://example.com/2">Link 2</a>
<a href="http://example.com/3">Link 3</a>
</body>
</html>
]])
for _, element in ipairs(document.links) do
if element.getAttribute then -- Ensure it's an Element and has attributes
local href = element:getAttribute("href")
if href then print(href) end
end
end
-- Sample Output:
-- http://example.com/1
-- http://example.com/2
-- http://example.com/3
This code snippet iterates through all the links in the document and prints their href
attributes. The lua-gumbo
library’s ability to parse and understand the structure of an HTML document simplifies the process of extracting specific elements based on their tags or attributes.