/stormscraper

A Storm based web crawler with Cassandra backend

Primary LanguageJava

Storm Scraper

TL;DR; Storm Scraper is an example storm program. Please do not think it's a production ready.

Storm Scraper is a simple storm topology that let's you crawl a website n-levels deep. It reads the list of sites to scrape from Cassandra and stores the html, incoming links, outgoing links, text.

I've only tested this locally

Setting are in src/main/resources/scraper.properties

To Run:

  • Run Cassandra

  • Create schema

cqlsh < stormscraper.cql
  • Run storm topology locally
MAVEN_OPTS=-Xmx1g mvn compile exec:java   

Brought to you by @tjake