Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Take this HTML as an example:
<!doctype html>
<html>
<body>
<section id="content">
<p class="headline">Floki</p>
<span class="headline">Enables search using CSS selectors</span>
<a href="https://github.com/philss/floki">Github page</a>
<span data-model="user">philss</span>
</section>
<a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>
Here are some queries that you can perform (with return examples):
Floki.find(html, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]
Floki.find(html, "p.headline")
|> Floki.raw_html
# => <p class="headline">Floki</p>
Each HTML node is represented by a tuple like:
{tag_name, attributes, children_nodes}
Example of node:
{"p", [{"class", "headline"}], ["Floki"]}
So even if the only child node is the element text, it is represented inside a list.
You can write a simple HTML crawler with Floki and HTTPoison:
html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)
It is simple as that!
Add Floki to your mix.exs
:
defp deps do
[
{:floki, "~> 0.20.0"}
]
end
After that, run mix deps.get
.
Floki needs the leex
module in order to compile.
Normally this module is installed with Erlang in a complete installation.
If you get this kind of error,
you need to install the erlang-dev
and erlang-parsetools
packages in order get the leex
module.
The packages names may be different depending on your OS.
You can configure Floki to use html5ever as your HTML parser.
This is recommended if you need better performance
and a more accurate parser. However html5ever
is being under active development and may be unstable.
Since it's written in Rust, we need to install Rust and compile the project. Luckily we have have the html5ever Elixir NIF that makes the integration very easy.
You still need to install Rust in your system. To do that, please follow the instruction presented in the official page.
After setup Rust, you need to add html5ever
NIF to your dependency list:
defp deps do
[
{:floki, "~> 0.20.0"},
{:html5ever, "~> 0.6.1"}
]
end
Run mix deps.get
and compiles the project with mix compile
to make sure it works.
Then you need to configure your app to use html5ever
:
# in config/config.exs
config :floki, :html_parser, Floki.HTMLParser.Html5ever
After that you are able to use html5ever
as your HTML parser with Floki.
For more info, check the article Rustler - Safe Erlang and Elixir NIFs in Rust.
To parse a HTML document, try:
html = """
<html>
<body>
<div class="example"></div>
</body>
</html>
"""
Floki.parse(html)
# => {"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}
To find elements with the class example
, try:
Floki.find(html, ".example")
# => [{"div", [{"class", "example"}], []}]
To convert your node tree back to raw HTML (spaces are ignored):
Floki.find(html, ".example")
|> Floki.raw_html
# => <div class="example"></div>
To fetch some attribute from elements, try:
Floki.attribute(html, ".example", "class")
# => ["example"]
You can get attributes from elements that you already have:
Floki.find(html, ".example")
|> Floki.attribute("class")
# => ["example"]
If you want to get the text from an element, try:
Floki.find(html, ".headline")
|> Floki.text
# => "Floki"
Here you find all the CSS selectors supported in the current version:
Pattern | Description |
---|---|
* | any element |
E | an element of type E |
E[foo] | an E element with a "foo" attribute |
E[foo="bar"] | an E element whose "foo" attribute value is exactly equal to "bar" |
E[foo~="bar"] | an E element whose "foo" attribute value is a list of whitespace-separated values, one of which is exactly equal to "bar" |
E[foo^="bar"] | an E element whose "foo" attribute value begins exactly with the string "bar" |
E[foo$="bar"] | an E element whose "foo" attribute value ends exactly with the string "bar" |
E[foo*="bar"] | an E element whose "foo" attribute value contains the substring "bar" |
E[foo|="en"] | an E element whose "foo" attribute has a hyphen-separated list of values beginning (from the left) with "en" |
E:nth-child(n) | an E element, the n-th child of its parent |
E:first-child | an E element, first child of its parent |
E:last-child | an E element, last child of its parent |
E:nth-of-type(n) | an E element, the n-th child of its type among its siblings |
E:first-of-type | an E element, first child of its type among its siblings |
E:last-of-type | an E element, last child of its type among its siblings |
E.warning | an E element whose class is "warning" |
E#myid | an E element with ID equal to "myid" |
E:not(s) | an E element that does not match simple selector s |
E F | an F element descendant of an E element |
E > F | an F element child of an E element |
E + F | an F element immediately preceded by an E element |
E ~ F | an F element preceded by an E element |
There are also some selectors based on non-standard specifications. They are:
Pattern | Description |
---|---|
E:fl-contains('foo') | an E element that contains "foo" inside a text node |
Floki is under MIT license. Check the LICENSE
file for more details.