Web Scraper
Web Scraper makes it effortless to scrape websites. You provide a URL and CSS selector and it will return you JSON containing the text contents of the matching elements. You can also scrape HTML attribute values by optionally specifying an attribute name.
Examples
Heading from example.com
web.scraper.workers.dev/?url=example.com&selector=h1
{"result":["Example Domain"]}
Profile details from github.com profile page
{
"result": {
".vcard-fullname": [
"Adam Schwartz"
],
".d-md-block [itemprop=worksFor]": [
"@cloudflare"
],
".d-md-block [itemprop=homeLocation]": [
"Boston, MA"
]
}
}
Random quote/author from quotes.net
web.scraper.workers.dev/?url=quotes.net/random.php&selector=%23disp-quote-body,.author&pretty=true
{
"result": {
"#disp-quote-body": [
"We are advertis'd by our loving friends."
],
".author": [
"William Shakespeare"
]
}
}
API
- Requests are made as
GET
againsthttps://web.scraper.workers.dev
. - There are two required query params,
url
andselector
. - There are three optional query params,
attr
,pretty
andspaced
.
https://web.scraper.workers.dev
?url=https://example.com
&selector=p
&attr=title
&pretty=true
&spaced=true
How it works
If at least url
and selector
are set, the response value will always be JSON.
If only one node is found on the page matching the selector
, the result will be a string. If more than one node is found, the result will be an array of strings.
If an attr
is provided, the result will be a string matching only the first node found which has a non-empty value for that HTML attribute.
Query params
url
(required)
- Supports
https://
andhttp://
protocols. - If a protocol isn’t found,
http://
is prepended.- e.g.
https://web.scraper.workers.dev/?url=example.com&selector=p
- e.g.
selector
(required)
- Supports the same set of CSS selectors as Cloudflare Workers'
HTMLRewriter
class- As of Oct 10, 2019, this includes:
*
– any elementE
– any element of type EE:not(s)
– an E element that does not match either compound selector sE.warning
– an E element belonging to the class warningE#myid
– an E element with ID equal to myid.E[foo]
– an E element with a foo attributeE[foo="bar"]
– an E element whose foo attribute value is exactly equal to barE[foo="bar" i]
– an E element whose foo attribute value is exactly equal to any (ASCII-range) case-permutation of barE[foo="bar" s]
– an E element whose foo attribute value is exactly and case-sensitively equal to barE[foo~="bar"]
– an E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to barE[foo^="bar"]
– an E element whose foo attribute value begins exactly with the string barE[foo$="bar"]
– an E element whose foo attribute value ends exactly with the string barE[foo*="bar"]
– an E element whose foo attribute value contains the substring barE[foo|="en"]
– an E element whose foo attribute value is a hyphen-separated list of values beginning with enE F
– an F element descendant of an E elementE > F
– an F element child of an E element
- As of Oct 10, 2019, this includes:
- Supports multiple selectors delimited with a comma.
attr
(optional)
- When
attr
is not set, the text contents of all matched nodes are returned. - When
attr
is set, that HTML attribute is scraped from the first matching node with a non-empty value for that attribute.
pretty
(optional)
- When
false
or not included, JSON is minified. - When
true
, JSON is formatted usingJSON.stringify(json, null, 2)
.
spaced
(optional)
- When
false
or not included, the text nodes of children of the nodes matchingselector
will be concatenated raw. - When
true
, a single space character is added after the end tag of each child node found.
Examples
Consider the following DOM structure:
<div><p>This is the first paragraph.</p><p>This is another paragraph.</p></div>
If the selector
is set to match div
, by default the resulting text will be:
This is the first paragraph.This is another paragraph.
This is because there is no space character between </p>
and <p>
.
With spaced
set to true
, the result is:
This is the first paragraph. This is another paragraph.
Development
Web Scraper is powered by Cloudflare Workers, heavily utilizing HTMLRewriter for parsing the HTML and scraping the text.
To develop Web Scraper locally, pull down the repo, and follow these steps:
- Installing the Workers CLI globally.
npm i @cloudflare/wrangler@1.4.0-rc.7 -g
- Run the preview/watcher inside the repo:
wrangler preview --watch
This will open up the Workers preview experience, so you can test and debug the site. The main source can be found in index.js
. As you make changes you’ll see them live in the previewer.
Deploying
Web Scraper is deployed automatically when changes are pushed to master using a GitHub Action and the Workers CLI.
Author
Web Scraper was created by Adam Schwartz.