a solution written in Python for a data engineering challenge test
- Python 3.x
- Scrapy for crawling
- python-readability for producing HTML summary
- PyMongo for connecting to MongoDB
- html2text for changing HTML to text for full text search in Mongo
- CherryPy for serving an API
pip3 install scrapy readability-lxml pymongo html2text cherrypy
1 . crawling:
$ ./crawl.sh
2 . Setup the mongo database
- refer to
setup_mongo.commands.txt
in the codebase
NB. for step 3 and 4, the environment variable 'ISENTIA_COMPOSE_MONGO_CONNECTION'
needs to be defined as a MongoDB connection string.
3 . cleanse data and load it to MongoDB
$ python3 cleanse_and_load.py <crawled_data.csv>
4 . running the API
$ python3 api_server.py
keywords
: a list of keywords to search from, syntax is the same as the $search field in Mongo $text query.article_format
: format can be'html'
or'text'
, default totext
.limit
: the number of articles to return, should not be bigger than 100.
- To use Scrapy to do some initial investigations to decide which sections and which attributes to crawl
- To investigate what to use cleanse the text (Readability was suggested)
- To investigate keyword search with stemming and casing (ElasticSearch could be an option)
- The data processing was split into 2 separate steps:
- Crawling using Scrapy, save it into a CSV file
- Cleansing the data using Readability and html2text, and load it into Compose.io
- Even though Scrapy has a Mongo pipeline, this was not chosen, so that the raw HTML can be scraped and saved on disk only.
- Only HTML and text summary (using Readability and html2text) was saved in the MongoDB, not the raw HTML.
- Since Mongo 3 supports full text search with stemming and casing, the use of ElasticSearch was unnecessary.
- Decided to host the API on my existing DigitalOcean server, instead of Amazon EC2.
- 27/Aug crawled 2,335 articles from the Guardian from my DigitalOcean server.
- 30/Aug cleansed the text with Readability and load the cleansed text and other meta data into compose.io.
- 30/Aug Search API implemented and deployed.