data-processing
There are 1258 repositories under data-processing topic.
onceupon/Bash-Oneliner
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
johnkerl/miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
TomWright/dasel
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
NVIDIA/DALI
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
pathwaycom/pathway
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
unionai-oss/pandera
A light-weight, flexible, and expressive statistical data testing library
dashbitco/broadway
Concurrent and multi-stage data ingestion and data processing with Elixir
asyml/texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
microsoft/DialoGPT
Large-scale pretraining for dialogue
python-bonobo/bonobo
Extract Transform Load for Python 3.5+
bytewax/bytewax
Python Stream Processing
GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
numaproj/numaflow
Kubernetes-native platform to run massively parallel data/streaming jobs
allenai/dolma
Data and tools for generating and inspecting OLMo pre-training data.
jofpin/synthBTC
A tool that uses advanced Monte Carlo simulations and Turbit parallel processing to create possible Bitcoin prediction scenarios.
microsoft/GODEL
Large-scale pretrained models for goal-directed dialog
GoogleCloudPlatform/DataflowJavaSDK
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
asyml/texar-pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
hstreamdb/hstream
HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications.
benibela/xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
SebKrantz/collapse
Advanced and Fast Data Transformation in R
ChenghaoMou/text-dedup
All-in-one text de-duplication
NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
infoslack/awesome-kafka
A list about Apache Kafka
kousun12/eternal
👾~ music, eternal ~ 👾
maykulkarni/Machine-Learning-Notebooks
Machine Learning notebooks for refreshing concepts.
constellation-rs/amadeus
Harmonious distributed data analysis in Rust.
polyaxon/haupt
Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
flow-php/etl
PHP - ETL (Extract Transform Load) data processing library
ml6team/fondant
Production-ready data processing made easy and shareable
lithops-cloud/lithops
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
Puchaczov/Musoq
SQL Swiss Army Knife - Engine for Diverse Data Sources
matousc89/padasip
Python Adaptive Signal Processing
alttch/rapidtables
Super fast list of dicts to pre-formatted tables conversion library for Python 2/3
streamnative/pulsar-flink
Elastic data processing with Apache Pulsar and Apache Flink