common-crawl

There are 40 repositories under common-crawl topic.

  • StringZilla

    ashvardanian/StringZilla

    Up to 100x faster strings for C, C++, CUDA, Python, Rust, Swift, JS, & Go, leveraging NEON, AVX2, AVX-512, SVE, GPGPU, & SWAR to accelerate search, hashing, sorting, edit distances, sketches, and memory ops 🦖

    Language:C2.7k2510389
  • commoncrawl/cc-pyspark

    Process Common Crawl data with Python and Spark

    Language:Python440192689
  • commoncrawl/news-crawl

    News crawling with StormCrawler - stores content as WARC

    Language:Java355325637
  • michaelharms/comcrawl

    A python utility for downloading Common Crawl data

    Language:Python2376942
  • commoncrawl/cc-crawl-statistics

    Statistics of Common Crawl monthly archives mined from URL index files

    Language:Python17717811
  • oscar-project/ungoliant

    :spider: The pipeline for the OSCAR corpus

    Language:Rust17124316
  • crissyfield/troll-a

    Drill into WARC web archives

    Language:Go1366111
  • commoncrawl/cc-webgraph

    Tools to construct and process Common Crawl webgraphs

    Language:Java8811145
  • oscar-project/goclassy

    An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

    Language:Go86926
  • commoncrawl/cc-notebooks

    Various Jupyter notebooks about Common Crawl data

    Language:Jupyter Notebook511729
  • IBM/cc-dbp

    A dataset for knowledge base population research using Common Crawl and DBpedia.

    Language:Java289219
  • bminixhofer/gerpt2

    German small and large versions of GPT2.

    Language:Python20100
  • cisnlp/GlotCC

    🕸 GlotCC Dataset and Pipline -- NeurIPS 2024

    Language:Jupyter Notebook20900
  • oscar-project/oscar-website

    The website of the Oscar Project

    Language:TeX1141014
  • toimik/CommonCrawl

    Common Crawl's processing tools

    Language:C#11200
  • Mgosi/Big-Data-Analysis-using-MapReduce-in-Hadoop

    We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.Data Analysis is very critical in understanding the data, and what we can do with the data. For small datasets it is easier to process and obtain the results. But as for big companies, it becomes crucial for them to obtain the trends of the company for any changes need to be made. Hence we introduce Big Data Analysis to solve this problem. In this lab, we collect close to 20000 tweets, 500 articles on New York Times and 500 articles on Common Crawl Data about Entertainment, which is our main topic of discussion. Using this data, we perform preprocessing and feed it to a MapReduce to find the Word Count and Word Co-Occurrence. Using this, we find the trend of the data collected in this topic. We have used Python to perform Data Analysis.

    Language:Jupyter Notebook8213
  • alumik/common-crawl-downloader

    Distributed download scripts for Common Crawl data

    Language:Python7010
  • tokenmill/common-crawl-utils

    Various Common Crawl utilities in Clojure.

    Language:Clojure7241
  • code402/warc-benchmark

    Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

    Language:Shell6207
  • HRN-Projects/common_crawl_with_scrapy

    Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.

    Language:Python6105
  • hrbrmstr/cc

    ⛏Extract metadata of a specific target based on the results of "commoncrawl.org"

    Language:R540
  • ilyankou/cc-gpx

    CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl

    Language:Jupyter Notebook5101
  • connor-marchand/gau-python

    This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau

    Language:Python3100
  • thunderpoot/cc-getpage

    Lightweight Python utility for retrieving individual pages from the Common Crawl archives.

    Language:Python320
  • mwoss/mors

    Application of topic models for information retrieval and search engine optimization.

    Language:Python2000
  • neil-zt/common-crawl-client

    A Common Crawl client example for scraping specific websites.

    Language:Jupyter Notebook2100
  • socket-var/nyt-twitter-cc-hadoop

    Perform big data analysis on New york times, Twitter and Common Crawl APIs

    Language:Jupyter Notebook2100
  • bottomless-archive-project/common-crawl-client

    This library is a very lightweight client to Common Crawl's WARC files.

    Language:Java1100
  • fizerkhan/cdx-index-client

    A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

    Language:Python120
  • fizerkhan/CommonCrawlDocumentDownload

    A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika

    Language:Java130
  • fizerkhan/KeywordAnalysis

    Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

    Language:Python120
  • bottomless-archive-project/url-collector

    An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

    Language:Java0150
  • Dahouabdelhalim/Discourse-marksers-and-Web-crawling

    Discourse Markers identification in French Language

    Language:HTML0100
  • hadrianw/abracabra

    Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

    Language:Rust0200
  • skyler-myers-db/Common-Crawl-Analysis

    Parsing the common crawl database using Scala and Spark

    Language:Scala10
  • srmocher/fake-science

    Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)

    Language:Python50