Web page inspector for Elixir.
Funkspector is a web scraper that lets you extract data from web pages and XML sitemaps.
Simply pass Funkspector the URL of a web page to inspect and it will return its scraped data:
iex> { :ok, data } = Funkspector.page_scrape("http://github.com")
Funkspector can extract the locations from XML sitemaps, like this:
iex> { :ok, data } = Funkspector.sitemap_scrape("http://validationhell.com/sitemap.xml")
Both Funkspector.page_scrape
and Funkspector.sitemap_scrape
accept options to customize the timeout and User Agent string.
For example, you could use:
Funkspector.page_scrape("http://github.com", %{recv_timeout: 5_000, user_agent: "My Bot"})
Funkspector.sitemap_scrape("http://validationhell.com/sitemap.xml", %{recv_timeout: 5_000, user_agent: "My Bot"})
Currently Funkspector returns this scraped data both from pages and sitemaps:
headers
. Response headers, including content-type etc.body
. Raw body.original_url
andfinal_url
. Funkspector follows redirections, here are the original URL given and the final one after following the redirections.scheme
. Like, "http" or "https".host
. Like, "github.com".root_url
. Root url for the given URL. Forhttp://example.com/about
it will behttp://example.com
.
The PageScraper also returns:
links
. Organized inraw
,http.internal
,http.external
andnon_http
.
The SitemapScraper also returns:
locs
. Collection ot URLs.
In case of error, Funkspector will return the original_url
and the reason from the server:
case Funkspector.page_scrape("http://example.com") do
{ :ok, data } ->
IO.inspect(data)
{ :error, url, reason } ->
IO.puts "Could not scrape #{url} because of #{reason}"
end
If available in Hex, the package can be installed as:
-
Add funkspector to your list of dependencies in
mix.exs
:def deps do [{:funkspector, "~> 0.1"}] end
-
Ensure funkspector is started before your application:
def application do [applications: [:funkspector]] end