Pinned Repositories
cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
CI-HiBench
Big Data benchmark from Intel called HiBench
common_crawl_index
Index URLs in Common Crawl
commoncrawl-examples
A library of examples showing how to use the Common Crawl corpus.
CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
HiBench
HiBench is a big data benchmark suite.
KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
CI-Research's Repositories
CI-Research/KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
CI-Research/cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
CI-Research/spark-Jupyter-AWS
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
CI-Research/CI-HiBench
Big Data benchmark from Intel called HiBench
CI-Research/common_crawl_index
Index URLs in Common Crawl
CI-Research/commoncrawl-examples
A library of examples showing how to use the Common Crawl corpus.
CI-Research/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
CI-Research/dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
CI-Research/HiBench
HiBench is a big data benchmark suite.