Worker of Distributed Web Crawler and Data Management System for Web Data
Tech Stack
What we provide
- Create workflows for crawling.
- Crawl and Parse product data in distributed environment (Master / Worker model).
- Upload / Update crawled data in the database incrementaly (View maintenance in Database).
- Upload / Update crawled data to target sites (View maintenance in target sites).
- Register schedule for crawling and view maintenance.
How to support
- Provide all services through GUI.
- git repository link: https://github.com/SML0127/pse-extension
- Easyly create workflow for crawling (no code, script).
- For crawling in distributed environment, we chose Breadth-First-Search Crawling Model and Redis & RQ as a Message Broker.
- For Breadth-First-Search Crawling Model, we created several operators for crawling.
- Docker image for our ubuntu environment
- git repository link for Master: https://github.com/SML0127/pse-master-Dockerfile
- git repository link for Worker: https://github.com/SML0127/pse-worker-Dockerfile
What environment, languages, libraries, and tools were used?
- Master / Worker are run at Ubuntu 20.04.
- Mainly based on Python for Master / Worker.
- React & JSX for GUI.
- Python Flask for Web Application Server & DB Server
- PostgreSQL for Database
- Apachi Airflow for Scheduling
- Redis & RQ for Message Broker in distributed environment
- Selenium & Chromedriver & XPath for Crawling
- Docker image for enviornment of Master / Worker
Overall Architecture
Overall Architecture with Implementaion
Screenshots of GUI
- Create a workflow for crawling.
- Get XPath for parameters of operators in the workflow.
- Save and load workflos.
- Crawled data and Error Message.
- History of Crawling and Upload / Update.
Demo videos
- Crawling
demo_crawling.mp4
- Upload / Update crawled data (View maintenance in Database).
demo_mysite.mp4
- Upload / Update crawled data to target sites (View maintenance in target sites).