Haskell:
Parsing HTML
How to:
For parsing HTML in Haskell, we’ll use the tagsoup
library for its simplicity and flexibility. First, make sure to install the library by adding tagsoup
to your project’s cabal file or by running cabal install tagsoup
.
{-# LANGUAGE OverloadedStrings #-}
import Text.HTML.TagSoup
-- Sample HTML for demonstration
let sampleHtml = "<html><body><p>Learn Haskell!</p><a href='http://example.com'>Click Here</a></body></html>"
-- Parse HTML and filter for links (a tags)
let tags = parseTags sampleHtml
let links = [fromAttrib "href" tag | tag <- tags, isTagOpenName "a" tag]
-- Print extracted links
print links
Sample output:
["http://example.com"]
For more sophisticated HTML parsing needs, consider using the pandoc
library, especially if you’re working with document conversion. It’s exceptionally versatile but comes with more complexity:
import Text.Pandoc
-- Assuming you have a Pandoc document (doc) loaded, e.g., from reading a file
let doc = ... -- Your Pandoc document goes here
-- Convert the document to HTML string
let htmlString = writeHtmlString def doc
-- Now, you would parse `htmlString` as above or proceed as per your requirements.
Keep in mind that pandoc
is a much larger library focusing on conversion between numerous markup formats, so use it if you need those extra capabilities or if you’re already dealing with document formats in your application.