web-data-extraction

There are 29 repositories under web-data-extraction topic.

  • firecrawl

    firecrawl/firecrawl

    🔥 The Web Data API for AI - Turn entire websites into LLM-ready markdown or structured data

    Language:TypeScript67k2577535.2k
  • MohamedHmini/iww

    AI based web-wrapper for web-content-extraction

    Language:Python1015314
  • neurons-me/this.url

    The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.

    Language:JavaScript5800
  • lightfeed/extractor

    Using LLMs and AI browser automation to robustly extract web data

    Language:TypeScript52005
  • luminati-io/java-web-scraping

    Quick guide with code example how to use Java for web scraping

  • dstark5/gnews-scraper

    GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information

    Language:TypeScript13233
  • DemonMartin/scrappey-wrapper

    An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)

    Language:JavaScript12124
  • jjonescz/awe

    AI-based web extractor

    Language:Python12102
  • Boomslet/Web_Crawler

    Open-source web crawler

    Language:Python9106
  • kaizenplatform/FacebookInsightsConnector

    The Tableau Web Data Connector for Facebook Insights API

    Language:JavaScript82414
  • SaurabhSSB/BookMiner

    A pipeline to scrape, extract, and analyze book data from web pages to insights.

    Language:HTML80
  • wbsg-uni-mannheim/WDCFramework

    Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.

    Language:Java8101
  • lekhmanrus/real-shot-pdf

    RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.

    Language:TypeScript6111
  • lightfeed/sdk

    Lightfeed SDK to search and filter web data

    Language:Python5101
  • oxpath/oxpath

    OXPath from Oxford

    Language:Java5201
  • wbsg-uni-mannheim/schemaorg-tables

    This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.

    Language:Python3102
  • hoxhaeris/get_muitiple

    Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.

    Language:Python2100
  • ranajahanzaib/wdx

    A web data extraction library written in golang.

    Language:Go210
  • wbsg-uni-mannheim/wdc-page

    This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl

    Language:HTML1201
  • gonzalopezgil/scraping-interface

    Python-based desktop app for effortless web scraping

    Language:Python0100
  • mibrahimbashir/customer_reviews

    A Comprehensive Script To Extract Customer Reviews For Machine Learning

    Language:Python0100
  • sc10ntech/extract-site-metadata

    Metadata extractor for the sprawling web ⚙️

    Language:TypeScript0141
  • BigDataIA-Spring2025-4/Web-and-PDF-Data-Extraction-Tool

    A Streamlit-based app with a FastAPI backend for extracting structured data (text, images, tables) from websites and PDFs. Processed data is stored in AWS S3 and rendered in a markdown-standardized format. APIs are deployed on Google Cloud Run Service

    Language:Jupyter Notebook
  • chelvanai/Web-data-scrap

    Web data scrpe by scrapy

    Language:Python10
  • dariga-sm/Word-Frequency-in-Moby-Dick

    Scrape the novel Moby Dick from the website Project Gutenberg using the Python package requests. Then you'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk)

    Language:HTML10
  • wbsg-uni-mannheim/StructuredDataProfiler

    Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.

    Language:Java10
  • proxywhirl

    wyattowalsh/proxywhirl

    rotating proxy system

    Language:Python
  • yumeangelica/store_data_extractor

    A Python-based web data extractor designed to monitor online stores and track product updates in real-time. This project is developed as a standalone module but is also part of the larger jirai_sweeties project, where it integrates with additional features.

    Language:Python
  • yumeangelica/user_agent_extractor

    This is a comprehensive web data extractor tool that extracts user agent strings from a user agent database website. The extractor is designed for high-volume data collection with enterprise-level capabilities.

    Language:Python