web-data-extraction

There are 23 repositories under web-data-extraction topic.

MohamedHmini/iww
AI based web-wrapper for web-content-extraction
Language:Python100 7 314
neurons-me/this.url
The this.url class is designed to fetch and parse URL data, returning an object with structured information that can then be used for machine learning algorithms in a database or other storage.
Language:JavaScript60 0 0
luminati-io/java-web-scraping
Quick guide with code example how to use Java for web scraping
16 0 04
DemonMartin/scrappey-wrapper
An API wrapper for Scrappey.com written in Node.js (cloudflare bypass & solver)
Language:JavaScript13 1 24
jjonescz/awe
AI-based web extractor
Language:Python11 2 02
dstark5/gnews-scraper
GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information
Language:TypeScript10 2 33
Boomslet/Web_Crawler
Open-source web crawler
Language:Python9 3 06
kaizenplatform/FacebookInsightsConnector
The Tableau Web Data Connector for Facebook Insights API
Language:JavaScript8 25 14
wbsg-uni-mannheim/WDCFramework
Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.
Language:Java8 1 01
lekhmanrus/real-shot-pdf
RealShotPDF is a Chrome extension designed to simplify the process of creating PDF documents from web content. The extension allows users to navigate through selected webpages, parse and display links in a tree view, and generate PDFs for the chosen pages. It operates locally without sending any data to external servers.
Language:TypeScript6 1 11
oxpath/oxpath
OXPath from Oxford
Language:Java4 2 01
wbsg-uni-mannheim/schemaorg-tables
This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.
Language:Python3 1 02
hoxhaeris/get_muitiple
Get and process multiple resources from web, using asyncio (aiohttp) to fetch the data and multiprocessing/multithreading for processing it.
Language:Python2 1 00
ranajahanzaib/wdx
A web data extraction library written in golang.
Language:Go2 1 0
wbsg-uni-mannheim/wdc-page
This repository contains the source files of the Web Data Commons website and is used to maintain the site. The Web Data Commons project extracts structured data from the Common Crawl
Language:HTML1 2 01
BigDataIA-Spring2025-4/DAMG7245_Assignment01
A Streamlit-based app with a FastAPI backend for extracting structured data (text, images, tables) from websites and PDFs. Processed data is stored in AWS S3 and rendered in a markdown-standardized format. APIs are deployed on Google Cloud Run Service
Language:Jupyter Notebook0 1 70
gonzalopezgil/scraping-interface
Python-based desktop app for effortless web scraping
Language:Python0 1 00
mibrahimbashir/customer_reviews
A Comprehensive Script To Extract Customer Reviews For Machine Learning
Language:Python00
sc10ntech/extract-site-metadata
Metadata extractor for the sprawling web ⚙️
Language:TypeScript0 1 41
chelvanai/Web-data-scrap
Web data scrpe by scrapy
Language:Python2 0
dariga-sm/Word-Frequency-in-Moby-Dick
Scrape the novel Moby Dick from the website Project Gutenberg using the Python package requests. Then you'll extract words from this web data using BeautifulSoup. Finally, we'll dive into analyzing the distribution of words using the Natural Language ToolKit (nltk)
Language:HTML1 0
wbsg-uni-mannheim/StructuredDataProfiler
Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.
Language:Java1 0
yumeangelica/store_data_extractor
A Python-based web data extractor designed to monitor online stores and track product updates in real-time. This project is developed as a standalone module but is also part of the larger jirai_sweeties project, where it integrates with additional features.
Language:Python

web-data-extraction

MohamedHmini/iww

neurons-me/this.url

luminati-io/java-web-scraping

DemonMartin/scrappey-wrapper

jjonescz/awe

dstark5/gnews-scraper

Boomslet/Web_Crawler

kaizenplatform/FacebookInsightsConnector

wbsg-uni-mannheim/WDCFramework

lekhmanrus/real-shot-pdf

oxpath/oxpath

wbsg-uni-mannheim/schemaorg-tables

hoxhaeris/get_muitiple

ranajahanzaib/wdx

wbsg-uni-mannheim/wdc-page

BigDataIA-Spring2025-4/DAMG7245_Assignment01

gonzalopezgil/scraping-interface

mibrahimbashir/customer_reviews

sc10ntech/extract-site-metadata

chelvanai/Web-data-scrap

dariga-sm/Word-Frequency-in-Moby-Dick

wbsg-uni-mannheim/StructuredDataProfiler

yumeangelica/store_data_extractor