extract-data

There are 242 repositories under extract-data topic.

  • opendatalab/MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Language:Python24.3k1268541.8k
  • bda-research/node-crawler

    Web Crawler/Spider for NodeJS + server-side jQuery ;-)

    Language:TypeScript6.7k255306878
  • PyMuPDF

    pymupdf/PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    Language:Python6.2k652.1k555
  • meltano/meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

    Language:Python1.9k146.7k167
  • markummitchell/engauge-digitizer

    Extracts data points from images of graphs

    Language:C++1.3k69382215
  • elixir-crawly/crawly

    Crawly, a high-level web crawling & scraping framework for Elixir.

    Language:Elixir1k20108116
  • slotix/dataflowkit

    Extract structured data from web sites. Web sites scraping.

    Language:Go671241380
  • danschultzer/receipt-scanner

    Receipt scanner extracts information from your PDF or image receipts - built in NodeJS

    Language:JavaScript298171856
  • OmkarPathak/ResumeParser

    A simple resume parser used for extracting information from resumes

    Language:Python2921634167
  • Qusic/TraceUtility

    Extract data from .trace documents generated by Instruments

    Language:Objective-C224314181
  • jpjacobpadilla/Stealth-Requests

    Undetected Web-Scraping & Seamless HTML Parsing in Python!

    Language:Python185419
  • yuanxu-li/html-table-extractor

    extract data from html table

    Language:Python8531622
  • ropensci/smapr

    An R package for acquisition and processing of NASA SMAP data

    Language:R82134325
  • CairX/extract-colors-py

    Extract colors from an image. Colors are grouped based on visual similarities using the CIE76 formula.

    Language:Python6822720
  • msoap/html2data

    Library and cli for extracting data from HTML via CSS selectors

    Language:Go68313
  • isaacmg/fb_scraper

    FBLYZE is a Facebook scraping system and analysis system.

    Language:Jupyter Notebook6482321
  • Techcatchers/PyLyrics-Extractor

    Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.

    Language:Python5531118
  • fivesmallq/web-data-extractor

    Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

    Language:Java5462019
  • asad70/Insider-Trading

    This program extracts insider trading data from the sec website and stores it in excel file for the specified time frame.

    Language:Python522215
  • osh/gr-eventstream

    gr-eventstream is a set of GNU Radio blocks for creating precisely timed events and either inserting them into, or extracting them from normal data-streams precisely. It allows for the definition of high speed time-synchronous c++ burst event handlers, as well as bridging to standard GNU Radio Async PDU messages with precise timing easily.

    Language:C++44111028
  • labteral/bluebird

    Unofficial Python client for Twitter

    Language:Python433514
  • giveabit/Trio-Plus-Data

    Extract audio and other data from the Digitech Trio Plus guitar pedal's SD card

    Language:Python421529
  • m92vyas/llm-reader

    Turn Webpage to LLM friendly input text. Similar to Jina Reader and Firecrawl API. Makes image & webpage links extraction easy for web scraping.

    Language:Python42203
  • Mamdouh66/Extracty

    Extract structured data from any unstructured web page

    Language:Python40302
  • Skyluker4/UnityAssetReplacer

    A tool to replace data in a Unity Asset Bundle from modified files.

    Language:C#39257
  • peterbencze/serritor

    Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.

    Language:Java3131915
  • ionictemplate-app/Social-Network-Data-Scraper-Pro

    Easily scrape 10,000+ email messages in one hour, helping you quickly increase your customers Extracts data from (LinkedIn, Facebook, Instagram, Youtube, Pinterest, Twitter) Perfect search by specific Keywords Ready-to-use Social Network Data Scraper Software to get started instantly 100% Include source code and install file

  • hseera/python-utilities

    Different python utility scripts to help automate mundane/repetitive tasks. Useful for performance testers/data scientist or anyone who wants to automate mundane tasks in python.

    Language:Python26201
  • serhaturtis/TOOL-FastBatchImageCrop

    A simple UI tool to batch crop images to prepare datasets from images and videos.

    Language:Python25322
  • peterstangl/svg2data

    A Python module for reading data from a plot provided as SVG file.

    Language:Python22653
  • righthandabacus/mdict_reader

    Extract data from Octopus mdict (*.mdd, *.mdx) files

    Language:Python22208
  • ark-mod/ArkSavegameToolkitNet

    Library for reading ARK Survival Evolved savegame files using C#.

    Language:C#214827
  • mhismail/PinPoint-Digitizer

    Open source digitizer application to extract data from plots

    Language:SCSS20111
  • pdfix/pdfix_sdk_example_cpp

    Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

    Language:C++20514
  • Akulbasov/PGA

    This is a library for making batch request to Google Analytics Core Reporting v3 API and extracting data from Google Analytics property into Python 3 data structures.

    Language:Python19700
  • alienzhou/giframe

    extract the first frame in GIF without reading whole bytes, support both browser and nodejs 📸

    Language:TypeScript17304