Elixir web inspector to unfurl URLs.
The package can be installed by adding web_inspector
to your list of dependencies in mix.exs
:
def deps do
[
{:web_inspector, git: "https://github.com/eteubert/web_inspector.git"}
]
end
WebInspector.unfurl("https://podlove.org")
{:ok,
%{
description: nil,
embed: nil,
icon: %{
height: "32",
type: "icon",
url: "https://podlove.org/files/2014/06/cropped-podlove-avatar-bkd-1024-32x32.png",
width: "32"
},
locations: ["https://podlove.org"],
original_url: "https://podlove.org",
providers: %{
misc: %{
"canonical_url" => nil,
"icons" => [
%{
height: "32",
type: "icon",
url: "https://podlove.org/files/2014/06/cropped-podlove-avatar-bkd-1024-32x32.png",
width: "32"
},
%{
height: "192",
type: "icon",
url: "https://podlove.org/files/2014/06/cropped-podlove-avatar-bkd-1024-192x192.png",
width: "192"
}
],
"title" => "Podlove | Personal Media Development"
},
oembed: %{},
open_graph: %{},
twitter: %{}
},
site_name: "podlove.org",
site_url: "https://podlove.org",
title: "Podlove | Personal Media Development",
url: "https://podlove.org"
}}
You can setup your own Puppeteer server if you need full control over the page visit and extracted data. It should return data in the following format:
{
site_name,
title,
description,
image,
url
}
The other extractors still run, but data extracted from the Puppeteer takes precedence.
Enable Puppeteer and define its hostname in the config:
config :web_inspector, puppeteer_enabled: true
config :web_inspector, puppeteer_host: "localhost:5000"
Instead of plain HTTP requests, there is a Scraping Ant adapter for more reliable results in case you run into bot detection issues. You need to put the API key into the environment variable SCRAPINGANT_API_KEY
.
For example, create a .env
file with the content:
SCRAPINGANT_API_KEY=abcde...789
It will be loaded automatically. Or use your own way of setting environment variables.