GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?
graphql-scraper
is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.
Check out a live demo here. You can easily spin up your own by using graphql-scraper-server
.
npx graphql-scraper <query-file>
or
npm install -g graphql-scraper
graphql-scraper <query-file>
Reads a GraphQL query from the path query-file
, and prints the result.
If query-file
is not given, reads the query from stdin.
--json
Returns the result in JSON format, for use in other tools.--help
Prints a help string.
Any other named options you pass to the CLI will be used as a query variable.
For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql
):
query ExampleQueryWithVariable($page: String) {
page(url: $page) {
items: queryAll(selector: "tr.athing") {
rank: text(selector: "td span.rank")
title: text(selector: "td.title a")
sitebit: text(selector: "span.comhead a")
url: attr(selector: "td.title a", name: "href")
attrs: next {
score: text(selector: "span.score")
user: text(selector: "a:first-of-type")
comments: text(selector: "a:nth-of-type(3)")
}
}
}
}
...and execute the query like this:
graphql-scraper query.graphql --page="https://news.ycombinator.com/"
You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.
The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema
, which you can use anywhere that expects a schema, for example apollo-server
or graphql-yoga
.
Use npm install graphql-scraper
or yarn add graphql-scraper
to add the schema to your project.
import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')
const query = `
{
page(url: "http://news.ycombinator.com") {
items: queryAll(selector: "tr.athing") {
rank: text(selector: "td span.rank")
title: text(selector: "td.title a")
sitebit: text(selector: "span.comhead a")
url: attr(selector: "td.title a", name: "href")
attrs: next {
score: text(selector: "span.score")
user: text(selector: "a:first-of-type")
comments: text(selector: "a:nth-of-type(3)")
}
}
}
}
`
graphql(schema, query).then(response => {
console.log(response)
})
This project was inspired by gdom
, which is written in Python and uses the Graphene GraphQL library.
If you want to switch over from gdom
, please note some schema changes:
query(selector: String!)
now only returns a singleElement
, rather than a list (likedocument.querySelector
). Added a newqueryAll(selector: String!): [Element]
field, which behaves likedocument.querySelectorAll
.is(selector: String!)
is renamed tohas(selector: String!)
.children
,parent
,siblings
,next
etc. no longer have aselector
argument. If you need to select children with a specific selector, use child selectors (.foo > .bar
).parents
is removed.prev[All]
is renamed toprevious[All]
.
PRs accepted.
MIT © 2018 harry lachenmayer