/data-engineering-challenge

a solution written in Python for a data engineering challenge test

Primary LanguagePython

data engineering challenge

a solution written in Python for a data engineering challenge test

To Run the Code

Framework and Dependencies

PIP Installation for Dependencies

pip3 install scrapy readability-lxml pymongo html2text cherrypy

Steps to run:

1 . crawling:

$ ./crawl.sh

2 . Setup the mongo database

  • refer to setup_mongo.commands.txt in the codebase

NB. for step 3 and 4, the environment variable 'ISENTIA_COMPOSE_MONGO_CONNECTION' needs to be defined as a MongoDB connection string.

3 . cleanse data and load it to MongoDB

$ python3 cleanse_and_load.py <crawled_data.csv>

4 . running the API

$ python3 api_server.py

API parameters

  • keywords: a list of keywords to search from, syntax is the same as the $search field in Mongo $text query.
  • article_format: format can be 'html' or 'text', default to text.
  • limit: the number of articles to return, should not be bigger than 100.

Approach

  • To use Scrapy to do some initial investigations to decide which sections and which attributes to crawl
  • To investigate what to use cleanse the text (Readability was suggested)
  • To investigate keyword search with stemming and casing (ElasticSearch could be an option)

Design Decisions:

  • The data processing was split into 2 separate steps:
    1. Crawling using Scrapy, save it into a CSV file
    2. Cleansing the data using Readability and html2text, and load it into Compose.io
  • Even though Scrapy has a Mongo pipeline, this was not chosen, so that the raw HTML can be scraped and saved on disk only.
  • Only HTML and text summary (using Readability and html2text) was saved in the MongoDB, not the raw HTML.
  • Since Mongo 3 supports full text search with stemming and casing, the use of ElasticSearch was unnecessary.
  • Decided to host the API on my existing DigitalOcean server, instead of Amazon EC2.

Log

  • 27/Aug crawled 2,335 articles from the Guardian from my DigitalOcean server.
  • 30/Aug cleansed the text with Readability and load the cleansed text and other meta data into compose.io.
  • 30/Aug Search API implemented and deployed.