common-crawl

There are 40 repositories under common-crawl topic.

ashvardanian/StringZilla
Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖
Language:C2.7k 25 10389
commoncrawl/cc-pyspark
Process Common Crawl data with Python and Spark
Language:Python440 19 2689
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
Language:Java355 32 5637
michaelharms/comcrawl
A python utility for downloading Common Crawl data
Language:Python237 6 942
commoncrawl/cc-crawl-statistics
Statistics of Common Crawl monthly archives mined from URL index files
Language:Python177 17 811
oscar-project/ungoliant
:spider: The pipeline for the OSCAR corpus
Language:Rust171 2 4316
crissyfield/troll-a
Drill into WARC web archives
Language:Go136 6 111
commoncrawl/cc-webgraph
Tools to construct and process Common Crawl webgraphs
Language:Java88 11 145
oscar-project/goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Language:Go86 9 26
commoncrawl/cc-notebooks
Various Jupyter notebooks about Common Crawl data
Language:Jupyter Notebook51 17 29
IBM/cc-dbp
A dataset for knowledge base population research using Common Crawl and DBpedia.
Language:Java28 9 219
bminixhofer/gerpt2
German small and large versions of GPT2.
Language:Python20 1 00
cisnlp/GlotCC
🕸 GlotCC Dataset and Pipline -- NeurIPS 2024
Language:Jupyter Notebook20 9 00
oscar-project/oscar-website
The website of the Oscar Project
Language:TeX11 4 1014
toimik/CommonCrawl
Common Crawl's processing tools
Language:C#11 2 00
Mgosi/Big-Data-Analysis-using-MapReduce-in-Hadoop
We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.
Language:Jupyter Notebook8 2 13
alumik/common-crawl-downloader
Distributed download scripts for Common Crawl data
Language:Python7 0 10
tokenmill/common-crawl-utils
Various Common Crawl utilities in Clojure.
Language:Clojure7 2 41
code402/warc-benchmark
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Language:Shell6 2 07
HRN-Projects/common_crawl_with_scrapy
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
Language:Python6 1 05
hrbrmstr/cc
⛏Extract metadata of a specific target based on the results of "commoncrawl.org"
Language:R5 4 0
ilyankou/cc-gpx
CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
Language:Jupyter Notebook5 1 01
connor-marchand/gau-python
This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
Language:Python3 1 00
thunderpoot/cc-getpage
Lightweight Python utility for retrieving individual pages from the Common Crawl archives.
Language:Python3 2 0
mwoss/mors
Application of topic models for information retrieval and search engine optimization.
Language:Python2 0 00
neil-zt/common-crawl-client
A Common Crawl client example for scraping specific websites.
Language:Jupyter Notebook2 1 00
socket-var/nyt-twitter-cc-hadoop
Perform big data analysis on New york times, Twitter and Common Crawl APIs
Language:Jupyter Notebook2 1 00
bottomless-archive-project/common-crawl-client
This library is a very lightweight client to Common Crawl's WARC files.
Language:Java1 1 00
fizerkhan/cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
Language:Python1 2 0
fizerkhan/CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
Language:Java1 3 0
fizerkhan/KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Language:Python1 2 0
bottomless-archive-project/url-collector
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
Language:Java0 1 50
Dahouabdelhalim/Discourse-marksers-and-Web-crawling
Discourse Markers identification in French Language
Language:HTML0 1 00
hadrianw/abracabra
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
Language:Rust0 2 00
skyler-myers-db/Common-Crawl-Analysis
Parsing the common crawl database using Scala and Spark
Language:Scala1 0
srmocher/fake-science
Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
Language:Python5 0

common-crawl

ashvardanian/StringZilla

commoncrawl/cc-pyspark

commoncrawl/news-crawl

michaelharms/comcrawl

commoncrawl/cc-crawl-statistics

oscar-project/ungoliant

crissyfield/troll-a

commoncrawl/cc-webgraph

oscar-project/goclassy

commoncrawl/cc-notebooks

IBM/cc-dbp

bminixhofer/gerpt2

cisnlp/GlotCC

oscar-project/oscar-website

toimik/CommonCrawl

Mgosi/Big-Data-Analysis-using-MapReduce-in-Hadoop

alumik/common-crawl-downloader

tokenmill/common-crawl-utils

code402/warc-benchmark

HRN-Projects/common_crawl_with_scrapy

hrbrmstr/cc

ilyankou/cc-gpx

connor-marchand/gau-python

thunderpoot/cc-getpage

mwoss/mors

neil-zt/common-crawl-client

socket-var/nyt-twitter-cc-hadoop

bottomless-archive-project/common-crawl-client

fizerkhan/cdx-index-client

fizerkhan/CommonCrawlDocumentDownload

fizerkhan/KeywordAnalysis

bottomless-archive-project/url-collector

Dahouabdelhalim/Discourse-marksers-and-Web-crawling

hadrianw/abracabra

skyler-myers-db/Common-Crawl-Analysis

srmocher/fake-science