Created a script to scrape web data from the AmazonBasics webpage. The script collects item information. This project serves as an exercise to demonstrate web scraping techniques using Puppeteer.js.
>> skip down to demo and results
Screen.Recording.2023-04-11.at.2.55.57.PM.mp4
Table of Contents
Instructions to get the copy of the project up and running on your local machine for development and testing purposes.
- Puppeteer.js
Project requires Node.js and npm installed.
To install dependencies, run the following command:
npm install
To run the script, use the following command:
npm start
The script was configured with the following options:
headless: false
- to display the browser's user interface. Determines whether to run the browser in headless mode.userDataDir: './tmp'
- a temporary directory created to store user data for the browser instance.
To modify these options, edit the puppeteer.launch()
method in index.js
.
The script includes a timeout option that determines how long puppeteer will wait for the product items to load. If the scrapper does not find 100 items within the specified time, it will stop and output the number of items it found. By default, the timeout is set to 30 seconds.
To modify the timeout, edit the timeout
variable in script2.js
.
Note that increasing the timeout can increase the time it takes for the script to complete, while decreasing the timeout can increase the risk of the scrapper not finding all 100 items. The timeout value should be set based on the performance of the website being scraped and the speed of your internet connection.
The Amazon Basics store page loads more items as you scroll down the page, rather than requiring a click to go to the next page. This webpage format may depend on the viewport size, which we set to a consistent value using the following code:
await page.setViewport({ width: 1280, height: 720 });
By setting the viewport size to a fixed width and height, we can ensure that the webpage format stays consist throughout other machines and we can follow the same method of scraping regardless of machine, by scrolling down.
To ensure that the script finds all 100 product items on the Amazon Basics store page, we use the following while loop. The loop scrolls down the page until 100 items have been loaded, or until the specified timeout has been reached.
while(itemsLoaded < 100 && Date.now() - start < timeout) {
await page.evaluate(() => {
window.scrollBy(0, window.innerHeight);
});
await page.waitForTimeout(1000); // wait 1 seconds for new items to load
itemsLoaded = await page.$$eval(".ProductGridItem__image__ih70n", (items) => items.length);
};
Screen.Recording.2023-04-11.at.2.55.57.PM.mp4
Sample: