Pinned Repositories
incubator-samza
Mirror of Apache Samza
kafka-embedded
Runs embedded, in-memory Apache Kafka instances. Helpful for integration testing.
kafka-manager
A tool for managing Apache Kafka.
kafka-spark-consumer
kafka-storm-starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.
kangaroo
Hadoop utilities for Kafka
klio
Smarter data pipelines for audio.
mpire
A Python package for easy multiprocessing, but faster than multiprocessing
Neuraxle
Build neat pipelines with the right abstractions to do AutoML. Let your pipeline steps have hyperparameter spaces. Enable checkpoints to cut duplicate calculations. Go from research to production environment easily.
rabit
Reliable Allreduce and Broadcast Interface for distributed machine learning
data-processing's Repositories
data-processing/dpark
Python clone of Spark, a MapReduce alike framework in Python
data-processing/kafka-spark-consumer
data-processing/spindle
Next-generation web analytics processing with Scala, Spark, and Parquet.
data-processing/streamparse
streamparse lets you run Python code against real-time streams of data. Integrates with Apache Storm.
data-processing/fluid
data-processing/cassovary
Cassovary is a simple big graph processing library for the JVM
data-processing/snowplow
Enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres
data-processing/druid
Real²time Exploratory Analytics on Large Datasets
data-processing/grill
data-processing/spark-ec2
Scripts used to setup a Spark cluster on EC2
data-processing/kafka-storm-starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.
data-processing/incubator-samza
Mirror of Apache Samza
data-processing/cdk
Cloudera Development Kit
data-processing/crunch
Crunch is an Apache TLP now, and lives at http://crunch.apache.org/
data-processing/Impatient
source examples to support the "Cascading for the Impatient" blog post series
data-processing/exelixi
Exelixi is a distributed framework based on Apache Mesos, mostly implemented in Python using gevent for high-performance concurrency. It is intended to run cluster computing jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it runs genetic algorithms at scale.
data-processing/storm-yarn
Storm for Yarn