/std-web-crawler

A over engineered crawler framework for extract the articles of the austrian journal DerStandard

Primary LanguagePython

DerStandard Web Crawler

This repository containes a web crawler for the austrian newspaper DerStandard. The crawler is written in Python and uses an overengineered architecture to crawl the content of all the aricles while respecting the politeness of the site. The rate limit is synchroniced accross mulitple workers to archive the highest possible request per minute (rpm) while also allowing the wokers the time to parse the content of the site. This enables for example the use of selenium which has to spin up a fully fledged browser to render the content of the site and simulate user interactions.