Parsing HTML

How to:

Elixir, with its robust concurrency model and functional programming paradigm, doesn’t include built-in HTML parsing capabilities. However, you can use popular third-party libraries like Floki for this purpose. Floki makes HTML parsing intuitive and efficient, leveraging Elixir’s pattern matching and piping features.

First, add Floki to your mix.exs dependencies:

defp deps do
    {:floki, "~> 0.31.0"}

Then, run mix deps.get to install the new dependency.

Now, let’s parse a simple HTML string to extract data. We’ll look for the titles inside <h1> tags:

html_content = """
    <h1>Hello, Elixir!</h1>
    <h1>Another Title</h1>

titles = html_content
         |> Floki.find("h1")
         |> Floki.text()


Sample Output:

["Hello, Elixir!", "Another Title"]

To dive deeper, say you want to extract links (<a> tags) alongside their href attributes. Here’s how you can achieve that:

html_content = """
    <a href="https://elixir-lang.org/">Elixir's Official Website</a>
    <a href="https://hexdocs.pm/">HexDocs</a>

links = html_content
        |> Floki.find("a")
        |> Enum.map(fn({_, attrs, [text]}) -> {text, List.keyfind(attrs, "href", 0)} end)

Sample Output:

[{"Elixir's Official Website", {"href", "https://elixir-lang.org/"}}, {"HexDocs", {"href", "https://hexdocs.pm/"}}]

This approach allows you to navigate and parse HTML documents efficiently, making web data extraction and manipulation tasks straightforward in Elixir applications.