twds-crawler

This repository contains the code to build a highly scalable webcrawler for towardsdatascience.com by using Python, Selenium, Docker, Kubernetes and the infrastructure of the Google Cloud Platform. It was part of a datascience-class to get in touch with some of the most common technologies when it comes to big web- and big data processing.

Documentation

A more detailed description of the implementation can be found in my medium.com article.

Trouble Shooting

Additionally I documented some of my challenges in the trouble-shooting.md

Postiii/twds-crawler

twds-crawler

Documentation

Trouble Shooting