GrabFood - Anakin (Assignment)

This is a web scraper to extract the restaurant's data from grabfood.sg. The scrapping is done using PlaywrightCrawler. The Proxy provider is Bright Data.

Approach and methodology

Here it's using a javascript library known as Crawlee. Crawlee handles the javascript-rendered web pages. Using headless browser (Playwright or Puppeteer) and Cheerio for scraping from html. Playwright is very similar to Selenium.
Crawlee handles the user-agent and browser fingerprint rotation internally but it can be customized further, It has logger and error handlers as well.
This scrapper is highly scalable and can be deployed as an individual actor on Apify.
For proxy provider, Bright Data is used. Apify also provides proxy for free but it has some limitations.

Challenges

Error(403) when accessing the website.
No Restaurant notice was visible.
Not able to get the delivery fee properly.
Not able to get the restaurant's latitude and longitude.

Improvements or Optimizations

To get the latitude and longitude Google map(SERP API) or Duckduckgo map search can be used.
Make the gzip simultaneously when the data is ready using the getData() method.

Quality Control Checklist

Environment Variables: Ensure that .env file exists and contains PROXY_USERNAME, PROXY_HOST, and PROXY_PASSWORD.
Proxy URLs Generation: Verify the logic for generating unique proxy session URLs and match the expected format for the proxy service.
Error Handling: Implement try-catch blocks around potential points of failure, such as network requests and file operations.
Resource Management: Verify that the crawler's headless mode is intentionally set to false for debugging or visual monitoring purposes, as this may consume more resources.
Permissions: Confirm that the website allows for automated control and interaction, especially for geolocation permissions.
Selector Accuracy: Verify that all selectors used ([aria-label="Change delivery address"], input#location-input, etc.) are current and match the elements on the webpage.
Error Handling: Add error handling for asynchronous operations, especially web scraping and network requests, to manage timeouts, and missing elements gracefully.
Data Integrity: Ensure the logic for extracting restaurant details (name, cuisine, rating, etc.) correctly handles null or undefined cases without causing the entire process to fail.
Infinite Scroll Handling: Confirm that the infiniteScroll function efficiently loads all necessary data without causing excessive load times or hitting rate limits.
Dataset Usage: Ensure data is being recorded and saved as expected.
Data Integrity: After compressing and decompressing, verify that the data integrity is maintained, and no information is lost or corrupted.
Efficiency: Compressing and then immediately decompressing the data within the same script for demonstration of the data corruption.
Code Organization: Ensure that your code is well-organized and logically structured for readability and maintainability.

itsApurba/grabfood-anakin

GrabFood - Anakin (Assignment)

Approach and methodology

Challenges

Improvements or Optimizations

Quality Control Checklist