data engineering challenge

a solution written in Python for a data engineering challenge test

To Run the Code

pip3 install scrapy readability-lxml pymongo html2text cherrypy

1 . crawling:

$ ./crawl.sh

2 . Setup the mongo database

NB. for step 3 and 4, the environment variable 'ISENTIA_COMPOSE_MONGO_CONNECTION' needs to be defined as a MongoDB connection string.

3 . cleanse data and load it to MongoDB

$ python3 cleanse_and_load.py <crawled_data.csv>

4 . running the API

$ python3 api_server.py

keywords: a list of keywords to search from, syntax is the same as the $search field in Mongo $text query.
article_format: format can be 'html' or 'text', default to text.
limit: the number of articles to return, should not be bigger than 100.

To use Scrapy to do some initial investigations to decide which sections and which attributes to crawl
To investigate what to use cleanse the text (Readability was suggested)
To investigate keyword search with stemming and casing (ElasticSearch could be an option)

The data processing was split into 2 separate steps:
1. Crawling using Scrapy, save it into a CSV file
2. Cleansing the data using Readability and html2text, and load it into Compose.io
Even though Scrapy has a Mongo pipeline, this was not chosen, so that the raw HTML can be scraped and saved on disk only.
Only HTML and text summary (using Readability and html2text) was saved in the MongoDB, not the raw HTML.
Since Mongo 3 supports full text search with stemming and casing, the use of ElasticSearch was unnecessary.
Decided to host the API on my existing DigitalOcean server, instead of Amazon EC2.

27/Aug crawled 2,335 articles from the Guardian from my DigitalOcean server.
30/Aug cleansed the text with Readability and load the cleansed text and other meta data into compose.io.
30/Aug Search API implemented and deployed.