Currently working on https://masqueradejs.com to replace this project as it is quite a bit out of date now, but in the mean time you can check out https://github.com/clouedoc/puppeteer-boiler which is similar and actively updated. πΎ
Foundation
is intended as a simple entry-point / template for developers new to designing Puppeteer bots.
It uses the (in)famous Puppeteer-Extra package as the primary Puppeteer
driver to enable its library of Stealth
plugins and evasions.
π PS: If you're working on botting and looking for a great developer community, check out the
Puppeteer-Extra
Discord server: https://discord.gg/vz7PeKk
Foundation
tries to avoid wrapping existing libraries and does not "add" much that doesn't already exist, but starting a new project with an unfamiliar library can come with a lot of questions around project structure and tooling.
This attempts to solve these issues with a ready-to-go scaffolding, however it should be noted that the structure is just, like, my opinion man... and considered under heavy flux.
However, breaking changes shouldn't matter, because its only intended as a starting point and you should take it in whatever direction makes sense.
If you're new to both modern JavaScript (ES6
& TypeScript
) and Puppeteer
, here's a quick rundown:
π Newbie Guide To Scraping With Puppeteer
β Note for Windows users: This project does not include
cross-env
, so using WSL and Terminal Preview are essentially a requirement.
$ git clone https://github.com/prescience-data/foundation.git && cd ./foundation # Clone the project
$ npm run init
The automatic version runs the following commands:
$ git clone https://github.com/prescience-data/foundation.git && cd ./foundation # Clone the project
$ npm run update # Updates the package.json file dependencies to latest versions
$ npm install --loglevel=error # Installs dependencies
$ npm run db:init # Initialises a sqlite database
$ npm run build:clean # Build the TypeScript code
Edit the .env
to your liking and add any services like Google Cloud Logging
etc.
β Remember to
.gitignore
andgit -rm -rf
your.env
file before committing to any public repositories.
The project is TypeScript so there are a few commands provided for this.
$ npm run build:clean # Just build the TypeScript files
or...
$ npm run bot # Builds the app and runs your entrypoint file
The project is split into two distinct parts, core
and app
.
This allows you to develop a quasi-framework that you can re-use between projects in the Core
concern, while keeping all project-specific code within the App
concern.
core/config.ts
.env
The project uses a .env
in the root to define most of the common environment variables, but you can call these from a database etc if you prefer.
The main Puppeteer LaunchOptions
are defined in the config.ts
file.
app/bot.ts
Main self-executing function entry-point.
This is where you execute each part of your scoped logic from the modules
section cleanly.
Make some magic happen π§β¨...
You call this module from the cli with:
$ npm run bot
You may wish to add cli arguments to direct the code in specific directions:
$ npm run bot -- --command=<CommandName>
Or if you prefer to shortcut your cli further you can add to your package.json
scripts:
{
"scripts": {
"bot:moon-prism-power": "npm run bot -- --command=moon-prism-power"
}
}
$ npm run bot:moon-prism-power β¨β¨β¨β¨
app/modules/<name>.ts
Your bot logic should be defined in clear logical scopes within the src/modules
folder. It's best to keep things neat and abstracted from the start to avoid huge, confusing, single-file blobs as your bot grows.
It might seem like overkill to abstract logic out at the start (which may be true for very simple bots), but you'll notice very quickly how bloated a modestly complete bot can get.
core/tests/<name>.ts
A large part of building your bot is rapidly testing it against known detection code.
Long-term, you'll want to develop your own internal tests by de-obfuscating the vendor code of your target, however for rapid early development, using hosted ones is fine.
You can use the existing detection tests provided, or build your own using the basic template provided.
export const PixelScan: PageLogic = async (page: Page): Promise<Record<string, any>> => {
// Load the test page.
await page.goto("https://pixelscan.net", { waitUntil: "networkidle2" })
await page.waitForTimeout(1500)
// Extract the result element text.
const element = await page.$("#consistency h1")
if (!element) {
throw new ElementNotFoundError(`Heading Tag`, element)
}
const result = (
await page.evaluate((element) => element.textContent, element)
).replace(/\s/g, " ").trim()
// Notify and return result.
return { result: result }
}
π§ If you add new tests remember to add them to the
index.ts
index to allow you to import all tests together if needed, and mainrun.ts
file to allow cli access.
To run your tests, use the command:
$ npm run tests -- --page=sannysoft
- DataDome
npm run tests -- --page=datadome
- FingerprintJS Pro
npm run tests -- --page=fingerprintjs
- AreYouHeadless
npm run tests -- --page=headless
- PixelScan
npm run tests -- --page=pixelscan
- ReCAPTCHA
npm run tests -- --page=recaptcha
- SannySoft
npm run tests -- --page=sannysoft
core/utils.ts
Aim to keep all your small, highly re-used utility functions in a single place.
- rand(min: number, max: number, precision?: number) Returns a random number from a range.
- delay(min: number, max: number) Shortcuts the rand method to return a options-ready object.
- whitespace(value: string) Strips all duplicate whitespace and trims the string.
core/browsers/<browser>.ts
All regular browsers are auto-loaded with the Stealth plugin.
- Chrome Using executable path. https://www.google.com/intl/en_au/chrome/
- Brave Using executable path. https://brave.com/
- Edge Using executable path. (Not available on Linux hosts) https://www.microsoft.com/en-us/edge
- Browserless https://docs.browserless.io/
- MultiLogin http://docs.multilogin.com/l/en/article/tkhr0ky2s6-puppeteer-browser-automation
- Incognition https://incogniton.com/knowledge%20center/selenium-browser-automation
// Using Chrome via the executable.
import Chrome from "../core/browsers"
const browser: Browser = await Chrome()
const page: Page = await browser.newPage()
// Using MultiLogin with a profile id.
import MultiLogin from "../core/browsers"
const browser: Browser = await MultiLogin({ profileId: "fa3347ae-da62-4013-bcca-ef30825c9311"})
const page: Page = await browser.newPage()
// Using Browserless with an api token.
import Browserless from "../core/browsers"
const browser: Browser = await Browserless(env.BROWSERLESS_TOKEN)
const page: Page = await browser.newPage()
storage/profiles/<uuid>
Local storage folder for switching Chrome profiles.
core/services/db.ts
prisma/schema.prisma
Uses the fantastic Prisma database abstraction library with a simple sqlite
database, but this can easily be configured for any local or remote RDBS or keystore database.
$ npm run db:init # Wipes the database and regenerates types and migrations
$ npm run db:migrate # Creates migrations
$ npm run db:migrate:refresh # Long version of init
$ npm run db:generate # Generates fresh prisma files
import { db } from "../core/services"
;(async () => {
// Bot execution code...
// If a result was returned, store it in the database.
if (result) {
db.scrape.create({
data: {
url: "https://www.startpage.com/en/privacy-policy/",
html: result,
},
})
}
})()
Additionally, you can build out shortcut methods in the database
folder to DRY out common database transactions.
/**
* Basic Prisma abstraction for a common task.
*
* @param {string} url
* @param {string} data
* @return {Promise<void>}
*/
export const storeScrape = async (
url: string,
data: string | Record<string, any>
): Promise<void> => {
// Flatten any objects passed in.
if (typeof data !== "string") {
data = JSON.stringify(data)
}
// Store the data.
db.scrape.create({
data: {
url: url,
data: data,
},
})
}
core/services/logger.ts
Uses Winston to handle logging and output. Can but configured to transport to console, file, or third-party transport like Google Cloud Logging
(provided).
Check the docs here to extend or configure transports / switch out completely.
- Winston https://github.com/winstonjs/winston
- Google Cloud Logging https://cloud.google.com/logging/docs
- Bugsnag https://docs.bugsnag.com/platforms/javascript/
To setup Google Cloud Logging
, you'll need a service account with Logs Writer
and Monitoring Metric Writer
permissions.
Guide:
- Create a GCP project https://console.cloud.google.com
- Enable the Cloud Logging API
- Create a service account
- required roles:
- Logging > Logs Writer
- Monitoring > Monitoring Metric Writer
- required roles:
- Add a JSON key to the service account and download it to
resources/google
- Make sure to edit the
.env
to match your service account key's filename ! (GOOGLE_LOGGING_KEYFILE
property)
The project comes preconfigured with the following tooling to keep your code neat and readable. Make sure to configure your IDE to pick up the configs.
-
Prettier
-
ESLint
π€·ββοΈAny contributions on this would be much appreciated!
- Writing
Mocha
tests - More demos!
- Define other database systems eg
Firebase
- Containerize with
Docker
- Write mouse movement recorder and database storage driver
- Add
ghost-cursor
to demo - Apply optional world isolation
- Add emojis to logger
- Migrate css selectors to xpath