Pinned Repositories
ansible-storm
Ansible playbook for deploying a Storm cluster
behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
behemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemoth
ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
stormcrawler-docker
Resources for running StormCrawler with Docker services
stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
textclassification-examples
Use cases for DigitalPebble's TextClassification API
TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
DigitalPebble Ltd's Repositories
DigitalPebble/behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
DigitalPebble/TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
DigitalPebble/stormcrawler-docker
Resources for running StormCrawler with Docker services
DigitalPebble/stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
DigitalPebble/textclassification-examples
Use cases for DigitalPebble's TextClassification API
DigitalPebble/ansible-storm
Ansible playbook for deploying a Storm cluster
DigitalPebble/TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
DigitalPebble/behemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemoth
DigitalPebble/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
DigitalPebble/ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
DigitalPebble/NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
DigitalPebble/tescobank
Setup for crawling tescobank with SC
DigitalPebble/sc-warc
WARC resources for StormCrawler
DigitalPebble/behemoth-elasticsearch
ElasticSearch module for Behemoth
DigitalPebble/behemoth-textclassification
Module for classifying Behemoth documents with a model from our Text Classification API
DigitalPebble/crawlurlfrontier
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
DigitalPebble/nutch
Apache Nutch is an extensible and scalable web crawler
DigitalPebble/urlfrontier-client
URLFrontier client written in Rust (mostly as a way of learning Rust)
DigitalPebble/benchmark
StormCrawler topology to evaluate the performance of different backends and configurations
DigitalPebble/crawler4j-frontier-battle
DigitalPebble/digitalpebble.github.io
Resources for the DigitalPebble website
DigitalPebble/docs
Documentation for Docker Official Images in docker-library
DigitalPebble/storm
Mirror of Apache Storm
DigitalPebble/tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
DigitalPebble/tika-cc
resources for generating a corpus of docs from CC for Tika
DigitalPebble/tika-detector-stormcrawler
Wraps the charset detection logic from StormCrawler as a Tika module