This is a web scraper to extract the restaurant's data from grabfood.sg. The scrapping is done using PlaywrightCrawler. The Proxy provider is Bright Data.
- Here it's using a javascript library known as Crawlee. Crawlee handles the javascript-rendered web pages. Using headless browser (Playwright or Puppeteer) and Cheerio for scraping from html. Playwright is very similar to Selenium.
- Crawlee handles the user-agent and browser fingerprint rotation internally but it can be customized further, It has logger and error handlers as well.
- This scrapper is highly scalable and can be deployed as an individual actor on Apify.
- For proxy provider, Bright Data is used. Apify also provides proxy for free but it has some limitations.
- Error(403) when accessing the website.
- No Restaurant notice was visible.
- Not able to get the delivery fee properly.
- Not able to get the restaurant's latitude and longitude.
- To get the latitude and longitude Google map(SERP API) or Duckduckgo map search can be used.
- Make the gzip simultaneously when the data is ready using the getData() method.
- Environment Variables: Ensure that
.env
file exists and containsPROXY_USERNAME
,PROXY_HOST
, andPROXY_PASSWORD
. - Proxy URLs Generation: Verify the logic for generating unique proxy session URLs and match the expected format for the proxy service.
- Error Handling: Implement try-catch blocks around potential points of failure, such as network requests and file operations.
- Resource Management: Verify that the crawler's
headless
mode is intentionally set tofalse
for debugging or visual monitoring purposes, as this may consume more resources. - Permissions: Confirm that the website allows for automated control and interaction, especially for geolocation permissions.
- Selector Accuracy: Verify that all selectors used (
[aria-label="Change delivery address"]
,input#location-input
, etc.) are current and match the elements on the webpage. - Error Handling: Add error handling for asynchronous operations, especially web scraping and network requests, to manage timeouts, and missing elements gracefully.
- Data Integrity: Ensure the logic for extracting restaurant details (name, cuisine, rating, etc.) correctly handles null or undefined cases without causing the entire process to fail.
- Infinite Scroll Handling: Confirm that the
infiniteScroll
function efficiently loads all necessary data without causing excessive load times or hitting rate limits. - Dataset Usage: Ensure data is being recorded and saved as expected.
- Data Integrity: After compressing and decompressing, verify that the data integrity is maintained, and no information is lost or corrupted.
- Efficiency: Compressing and then immediately decompressing the data within the same script for demonstration of the data corruption.
- Code Organization: Ensure that your code is well-organized and logically structured for readability and maintainability.