Warden
Continuous Web Scraping Framework
Framework for easily creating web scrapers that continuously check for new results and notify user on changes.
For example, you're interested in news from lobste.rs and
would like to receive daily email newsletter. You could use lobsters
as input
and sendgrid
as output to create a job like this:
{
id: 'lobsters-daily',
name: 'Lobste.rs Daily',
scheduleAt: '0 12 * * *',
inputs: [lobsters()],
outputs: [
sendgrid({
apiKey: 'sekret',
sender: 'warden@example.com',
recipients: ['me@example.com'],
}),
],
}
This job is scheduled to be run at noon (0 12 * * *
is cron
syntax) and
will send an email to me@example.com
with latest news from
lobste.rs.
It's possible to mix and match inputs/outputs in various ways. See
src/inputs/
for
available inputs, see
src/outputs/
for
available outputs.
It's also possible to quickly make a new input or output with TypeScript.
Here's a more complicated example that scrapes ss.com for Audi, BMW and Mercedes with 3.0+ liter gasoline engine, manual transmission and price starting from 10k EUR. Then it notifies user by printing out to console and sending an email. It's scheduled to be run on every hour if it's 9-17 and weekday.
{
id: 'ss-audi-bmw-mercedes',
name: 'SS Audi, BMW & Mercedes',
scheduleAt: '0 9-17 * * 1-5',
inputs: ['audi', 'bmw', 'mercedes'].map(model =>
ss({
section: `transport/cars/${model}`,
filters: {
engineSizeLitersMin: '3.0',
fuelType: FuelType.Gasoline,
transmission: Transmission.Manual,
priceMin: 10000,
},
}),
),
outputs: [
terminal(),
sendgrid({
apiKey: 'sekret',
sender: 'warden@example.com',
recipients: ['me@example.com'],
}),
],
}