/pse-worker

Worker of Distributed Web Crawler and Data Management System for Web Data

Primary LanguagePython

Worker of Distributed Web Crawler and Data Management System for Web Data

Tech Stack

Ubuntu React Python Flask Postgres Apache Airflow Redis Selenium Docker NodeJS NPM



What we provide

  • Create workflows for crawling.
  • Crawl and Parse product data in distributed environment (Master / Worker model).
  • Upload / Update crawled data in the database incrementaly (View maintenance in Database).
  • Upload / Update crawled data to target sites (View maintenance in target sites).
  • Register schedule for crawling and view maintenance.


How to support



What environment, languages, libraries, and tools were used?

  • Master / Worker are run at Ubuntu 20.04.
  • Mainly based on Python for Master / Worker.
  • React & JSX for GUI.
  • Python Flask for Web Application Server & DB Server
  • PostgreSQL for Database
  • Apachi Airflow for Scheduling
  • Redis & RQ for Message Broker in distributed environment
  • Selenium & Chromedriver & XPath for Crawling
  • Docker image for enviornment of Master / Worker

Overall Architecture


overall_architecture

Overall Architecture with Implementaion


overall_architecture



Screenshots of GUI

  • Create a workflow for crawling.

  • Get XPath for parameters of operators in the workflow.

  • Save and load workflos.

  • Crawled data and Error Message.

  • History of Crawling and Upload / Update.



Demo videos

  • Crawling
demo_crawling.mp4
  • Upload / Update crawled data (View maintenance in Database).
demo_mysite.mp4
  • Upload / Update crawled data to target sites (View maintenance in target sites).
demo_targetsite.mp4