/cc-web-scraper

Fullstack Coding Challenge Web Scraper

Primary LanguagePythonMIT LicenseMIT

Coding Challenge Web Scraper

A simple Web-Application for Analyzing Web-Sites

Setup

You can get the web app up and running using docker-compose:

$ docker-compose up --build

Backend

The backend implementation consists of the scraper package that contains the scraping library, paired with a minimal flask application that wraps the scraper with a web API and caches calls in memcached.

Testing

After building the docker image, install the development packages with:

$ pipenv install --dev

And then run the tests:

$ pipenv run python -m pytest

Tech stack

  • Python 3.6
  • Flask 1.0
  • Flask-Restful
  • BeautifulSoup4
  • Pipenv
  • Envdir
  • Gunicorn
  • Memcached

Frontend

The fronted implementation is a minima Angular 5 app that renders the input form and handles the API calls.

Tech stack

  • Angular 5
  • Angular Material
  • Tachyons
  • Nginx 1.13

Infrastructure

  • Docker 18.04
  • Docker Compose

Development log

  • I started with developing the web scraper in a separate package, and adding unit tests.
  • The next step was to develop the Flask app that would wrap the scraper package and expose a web API.
  • The backend development concluded with setting up memcached and using it to cache the results of the scraper for 24 hours.
  • Then I worked on developing the Angular 5 frontend which required a lot of documentation reading.
  • I polished the UI, added a spinner and styling with the tachyons library.
  • Last step was to write this document.

Implementation decisions

  • The scope of the web app was deemed too small to justify adding any authentication mechanism.
  • Likewise there is no automated documentation generated by the backend API.
  • The backend has extensive test coverage.
  • The html_version return by the scraper is just the document's doctype. It could be improved with a nicer mapping to version numbers.
  • The login_form detection is based only on the existence of a input field of type password.
  • The async fetching of inaccessible urls could be tweaked and improved more.