DigitalPebble Ltd

Bristol, UK

Pinned Repositories

ansible-storm
Ansible playbook for deploying a Storm cluster
7 5 11
behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Language:Java281 44 4260
behemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemoth
Language:Java4 6 00
ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Language:Java4 2 02
NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
Language:Java4 4 00
stormcrawler-docker
Resources for running StormCrawler with Docker services
Language:Dockerfile10 4 13
stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
Language:Shell10 3 05
TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Language:Java48 16 121
textclassification-examples
Use cases for DigitalPebble's TextClassification API
Language:Java10 2 03
TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
Language:Java5 9 13

DigitalPebble Ltd's Repositories

DigitalPebble/behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Language:Java281 44 4260
DigitalPebble/TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Language:Java48 16 121
DigitalPebble/stormcrawler-docker
Resources for running StormCrawler with Docker services
Language:Dockerfile10 4 13
DigitalPebble/stormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawler
Language:Shell10 3 05
DigitalPebble/textclassification-examples
Use cases for DigitalPebble's TextClassification API
Language:Java10 2 03
DigitalPebble/ansible-storm
Ansible playbook for deploying a Storm cluster
7 5 11
DigitalPebble/TextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification API
Language:Java5 9 13
DigitalPebble/behemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemoth
Language:Java4 6 00
DigitalPebble/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
Language:Java4 6 01
DigitalPebble/ngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Language:Java4 2 02
DigitalPebble/NutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutch
Language:Java4 4 00
DigitalPebble/tescobank
Setup for crawling tescobank with SC
Language:Java4 4 02
DigitalPebble/sc-warc
WARC resources for StormCrawler
2 4 111
DigitalPebble/behemoth-elasticsearch
ElasticSearch module for Behemoth
Language:Java1 5 0
DigitalPebble/behemoth-textclassification
Module for classifying Behemoth documents with a model from our Text Classification API
Language:Java1 2 0
DigitalPebble/crawlurlfrontier
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
Language:FLUX1 3 0
DigitalPebble/nutch
Apache Nutch is an extensible and scalable web crawler
Language:Java1 1 0
DigitalPebble/urlfrontier-client
URLFrontier client written in Rust (mostly as a way of learning Rust)
Language:Rust1 2 0
DigitalPebble/benchmark
StormCrawler topology to evaluate the performance of different backends and configurations
Language:Shell2 1
DigitalPebble/crawler4j-frontier-battle
Language:Java2 0
DigitalPebble/digitalpebble.github.io
Resources for the DigitalPebble website
Language:SCSS3 0
DigitalPebble/docs
Documentation for Docker Official Images in docker-library
Language:Shell1 0
DigitalPebble/storm
Mirror of Apache Storm
Language:Java3 0
DigitalPebble/tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
Language:Java1 0
DigitalPebble/tika-cc
resources for generating a corpus of docs from CC for Tika
Language:Shell3 0
DigitalPebble/tika-detector-stormcrawler
Wraps the charset detection logic from StormCrawler as a Tika module
Language:Java1

DigitalPebble Ltd

Pinned Repositories

ansible-storm

behemoth

behemoth-commoncrawl

ngrams-api

NutchFight

stormcrawler-docker

stormcrawlerfight

TextClassification

textclassification-examples

TextClassificationPlugin

DigitalPebble Ltd's Repositories

DigitalPebble/behemoth

DigitalPebble/TextClassification

DigitalPebble/stormcrawler-docker

DigitalPebble/stormcrawlerfight

DigitalPebble/textclassification-examples

DigitalPebble/ansible-storm

DigitalPebble/TextClassificationPlugin

DigitalPebble/behemoth-commoncrawl

DigitalPebble/crawler-commons

DigitalPebble/ngrams-api

DigitalPebble/NutchFight

DigitalPebble/tescobank

DigitalPebble/sc-warc

DigitalPebble/behemoth-elasticsearch

DigitalPebble/behemoth-textclassification

DigitalPebble/crawlurlfrontier

DigitalPebble/nutch

DigitalPebble/urlfrontier-client

DigitalPebble/benchmark

DigitalPebble/crawler4j-frontier-battle

DigitalPebble/digitalpebble.github.io

DigitalPebble/docs

DigitalPebble/storm

DigitalPebble/tika

DigitalPebble/tika-cc

DigitalPebble/tika-detector-stormcrawler