System requirements
- Node.js 18+
- Windows 10+, Windows Server 2016+ or Windows Subsystem for Linux (WSL).
- macOS 13 Ventura, or macOS 14 Sonoma.
- Debian 11, Debian 12, Ubuntu 20.04 or Ubuntu 22.04, Ubuntu 24.04, on x86-64 and arm64 architecture.
- Clone the repo
git clone https://github.com/ericrav/playwright-web-scraping.git
cd playwright-web-scraping
- Install Node and dependencies
You must have Node.js version >= 18 to run Playwright and this repo.
Check your node version with node -v in the terminal.
If you don't have the right version, try installing nvm (Node Version Manager) to install and use the latest version.
Then, install the dependencies:
npm install
Then, install browser binaries used by Playwright with:
npx playwright install --with-deps
More info: https://playwright.dev/docs/browsers
- Run example scraper
npm run scrape
You should see a Chromium browser window open and close automatically, a snippet of text printed in your terminal, and an image saved to the ./screenshots directory. Now you're ready to scrape the web!
To run your workshop scraper, run in the terminal:
node scrapers/00-workshop.js
Experiment with Playwright functions in an interactive Node.js repl terminal (repl = "read evaluate print loop"). This will open the browser instance and Playwright inspector to show you what's happening.
npm run repl
Note: The repl script is ran with the Playwright debugger, so make sure to click the "Resume" button in the browser inspector so your commands can actually execute.
Use the PWDEBUG=inspector environment variable to debug your script with the Playwright inspector:
PWDEBUG=inspector node scrapers/00-workshop.js
Add await page.pause(); in your script to pause as a specific step.
npx playwright codegen nyu.edu