scrapi

Getting started

To run absolutely everyting, you will need to:
- Install requirements.
- Install Elasticsearch
- Install Cassandra
- Install harvesters
- Install rabbitmq (optional)
To only run harvesters locally, you do not have to install rabbitmq

Requirements

Create and enter virtual environment for scrapi, and go to the top level project directory. From there, run

$ pip install -r requirements.txt

and the python requirements for the project will download and install.

Installing Cassandra and Elasticsearch

note: JDK 7 must be installed for Cassandra and Elasticsearch to run

Mac OSX

$ brew install cassandra
$ brew install elasticsearch

Ubuntu

Install Cassandra

Check which version of Java is installed by running the following command:
```
$ java -version
```
Use the latest version of Oracle Java 7 on all nodes.

Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list

$ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

Add the DataStax repository key to your aptitude trusted keys.

$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -

Install the package.

$ sudo apt-get update
$ sudo apt-get install dsc20=2.0.11-1 cassandra=2.0.11

Install ElasticSearch

Download and install the Public Signing Key.

$ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -

Add the ElasticSearch repository to yout /etc/apt/sources.list.

$ sudo add-apt-repository "deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main"

Install the package

$ sudo apt-get update
$ sudo apt-get install elasticsearch

Now, just run

$ cassandra
$ elasticsearch

Or, if you'd like your cassandra session to be bound to your current session, run:

$ cassandra -f

and you should be good to go.

(Note, if you're developing locally, you do not have to run Rabbitmq!)

Rabbitmq (optional)

Mac OSX

$ brew install rabbitmq

Ubuntu

$ sudo apt-get install rabbitmq-server

Settings

You will need to have a local copy of the settings

cp scrapi/settings/local-dist.py scrapi/settings/local.py

(Note: only needed if NOT running locally!)

Running the scheduler (optional)

from the top-level project directory run:

$ invoke beat

to start the scheduler, and

$ invoke worker

to start the worker.

Harvesters

Run all harvesters with

$ invoke harvesters

or, just one with

$ invoke harvester harvester-name

Invove a harvester a certain number of days back with the --days argument. For example, to run a harvester 5 days in the past, run:

$ invoke harvester harvester-name --days=5

Testing

To run the tests for the project, just type

$ invoke test

and all of the tests in the 'tests/' directory will be run.

GrantRVD/scrapi

scrapi

Getting started

Requirements

Installing Cassandra and Elasticsearch

Mac OSX

Ubuntu

Install Cassandra

Install ElasticSearch

Rabbitmq (optional)

Mac OSX

Ubuntu

Settings

Running the scheduler (optional)

Harvesters

Testing