Requirements to run this project on your system: python 3.6+, pip, docker, docker-compose. Ideally run on a Linux distro of your choice.
# Clone this repository
git clone https://github.com/aakashhemadri/ir.git
cd ir
# Install pipenv, This is likely already installed on system
# Use your appropriate python binary in place of `python3`
python3 -m pip --user install pipenv
# Install pipenv environment
cd /path/to/project/root
python3 -m pipenv install
Always run the below before running the usage commands.
# Enter python environment
cd /path/to/project/root
python3 -m pipenv shell
Run initial setup script
# To setup the the docker env and do a crawl on ars-technica
# Please inspect the script before running
# If docker/docker-compose was setup correctly kibana should be up on localhost:5601
# Currently one must import csv's externally through kibana after crawling sites.
# Pre-Crawled data is under data/*, Use that.
cd /path/to/project/root
sh init.sh
Crawling custom spiders
# Specifically crawling with scrapy
cd /path/to/project/root
scrapy crawl ArsTechnica -o ars-technica.new.csv