Scrape ODU's CPI site for projects. Based on scrapy.
Windows can be a pain to setup Python. Use Cygwin or install a Linux virtual machine.
- Python 2.7+ or 3.4+
- Docker
- (optional) Docker Compose
Checkout the repo
git clone https://github.com/asark002/scrape-cpi.git
cd scrape-cpi
Create a virtual environment (virtualenv
or venv
) and install the packages for the crawler.
python3 -m venv .virt
source .virt/bin/activate
pip install -r requirements.txt
The splash
service is the Javascript render engine that behaves like a web browser.
Mariana's crawler component uses this service so that the HTML content can be fully rendered prior to scraping.
Docker is the easiest way to run this service:
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
source .virt/bin/activate
python rest_server.py
version: '3'
services:
crawler-http-api:
build:
context: .
dockerfile: ./Dockerfile
ports:
- "9801:9801"
links:
- splash
- elasticsearch
splash:
image: "scrapinghub/splash"
ports:
- 8050:8050
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.2
container_name: elasticsearch
environment:
- cluster.name=docker-cluster
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata1:/usr/share/elasticsearch/data
ports:
- 9200:9200
volumes:
esdata1:
driver: local