/blackspider

A lightweight Scala web crawler and news classifier

Primary LanguageScalaGNU General Public License v3.0GPL-3.0

Blackspider

A lightweight crawler and news classifier

Blackspider components

###Crawler Get links/nodes, build edges between them and download web documents

###Indexer Indexing web document to speed up search query

###Ranker Ranking documents using PageRank algorithm

###News Monitor Monitor and update latest news from the news source – Re-crawl / using RSS

###Tokenizer Extract features/tokens from web documents to classify

###Classifier Be able to classify new crawled web pages using Naïve Bayes algorithm

Blackspider architecture

Overall Architecture