Parsing HTML

Elixir:
Parsing HTML

How to:

Elixir, with its robust concurrency model and functional programming paradigm, doesn’t include built-in HTML parsing capabilities. However, you can use popular third-party libraries like Floki for this purpose. Floki makes HTML parsing intuitive and efficient, leveraging Elixir’s pattern matching and piping features.

First, add Floki to your mix.exs dependencies:

defp deps do
  [
    {:floki, "~> 0.31.0"}
  ]
end

Then, run mix deps.get to install the new dependency.

Now, let’s parse a simple HTML string to extract data. We’ll look for the titles inside <h1> tags:

html_content = """
<html>
  <body>
    <h1>Hello, Elixir!</h1>
    <h1>Another Title</h1>
  </body>
</html>
"""

titles = html_content
         |> Floki.find("h1")
         |> Floki.text()

IO.inspect(titles)

Sample Output:

["Hello, Elixir!", "Another Title"]

To dive deeper, say you want to extract links (<a> tags) alongside their href attributes. Here’s how you can achieve that:

html_content = """
<html>
  <body>
    <a href="https://elixir-lang.org/">Elixir's Official Website</a>
    <a href="https://hexdocs.pm/">HexDocs</a>
  </body>
</html>
"""

links = html_content
        |> Floki.find("a")
        |> Enum.map(fn({_, attrs, [text]}) -> {text, List.keyfind(attrs, "href", 0)} end)
        
IO.inspect(links)

Sample Output:

[{"Elixir's Official Website", {"href", "https://elixir-lang.org/"}}, {"HexDocs", {"href", "https://hexdocs.pm/"}}]

This approach allows you to navigate and parse HTML documents efficiently, making web data extraction and manipulation tasks straightforward in Elixir applications.

Last updated on March 13, 2024

Downloading a web page Sending an HTTP request

Elixir:Parsing HTML

How to:

Elixir:
Parsing HTML