Parsing HTML

Haskell:
Parsing HTML

How to:

For parsing HTML in Haskell, we’ll use the tagsoup library for its simplicity and flexibility. First, make sure to install the library by adding tagsoup to your project’s cabal file or by running cabal install tagsoup.

{-# LANGUAGE OverloadedStrings #-}

import Text.HTML.TagSoup

-- Sample HTML for demonstration
let sampleHtml = "<html><body><p>Learn Haskell!</p><a href='http://example.com'>Click Here</a></body></html>"

-- Parse HTML and filter for links (a tags)
let tags = parseTags sampleHtml
let links = [fromAttrib "href" tag | tag <- tags, isTagOpenName "a" tag]

-- Print extracted links
print links

Sample output:

["http://example.com"]

For more sophisticated HTML parsing needs, consider using the pandoc library, especially if you’re working with document conversion. It’s exceptionally versatile but comes with more complexity:

import Text.Pandoc

-- Assuming you have a Pandoc document (doc) loaded, e.g., from reading a file
let doc = ... -- Your Pandoc document goes here

-- Convert the document to HTML string
let htmlString = writeHtmlString def doc

-- Now, you would parse `htmlString` as above or proceed as per your requirements.

Keep in mind that pandoc is a much larger library focusing on conversion between numerous markup formats, so use it if you need those extra capabilities or if you’re already dealing with document formats in your application.

Last updated on March 13, 2024

Downloading a web page Sending an HTTP request

Haskell:Parsing HTML

How to:

Haskell:
Parsing HTML