Incomplete-but-useful list of big-data related projects packed into a JSON dataset.
External references: Main page, Raw JSON data of projects, Original page on my blog
Related projects: Hadoop Ecosystem Table by Javi Roman, Awesome Big Data by Onur Akpolat, Awesome Awesomeness by Alexander Bayandin, Awesome Hadoop by Youngwoo Kim, Queues.io by Łukasz Strzałkowski
Add a new JSON file to projects-data
directory. Here is an example:
{
"name": "Apache Hadoop",
"description": "framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)",
"abstract": "framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)",
"category": "Frameworks",
"tags": ["framework", "yahoo", "apache"],
"links": [{"text": "Apache Hadoop", "url": "http://hadoop.apache.org/"}]
}
Add a new JSON file to papers-data
directory. Here is an example:
{
"title": "The Google File System",
"year": "2003",
"authors": "",
"abstract": "",
"tags": ["google"],
"links": [{"text": "PDF Paper", "url": "http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf"}]
}
- Frameworks
- Distributed Programming
- Distributed Filesystem
- Key-Map Data Model
- Document Data Model
- Key-value Data Model
- Graph Data Model
- NewSQL Databases
- Columnar Databases
- Time-Series Databases
- SQL-like processing
- Integrated Development Environments
- Data Ingestion
- Message-oriented middleware
- Service Programming
- Scheduling
- Machine Learning
- Benchmarking
- Security
- System Deployment
- Container Manager
- Applications
- Search engine and framework
- MySQL forks and evolutions
- PostgreSQL forks and evolutions
- Memcached forks and evolutions
- Embedded Databases
- Business Intelligence
- Data Analysis
- Data Warehouse
- Data Visualization
- Internet of Things
- Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
- AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
- Akela - Mozilla's utility library for Hadoop, HBase, Pig, etc..
- Amazon Lambda - a compute service that runs your code in response to events and automatically manages the compute resources for you.
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- AMPLab Succinct - Enabling Queries on Compressed Data.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- Apache Flink - high-performance runtime, and automatic program optimization.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache S4 - framework for stream processing, implementation of S4.
- Apache Spark - framework for in-memory cluster computing.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
- DistributedR - scalable high-performance platform for the R language.
- Drools - a Business Rules Management System (BRMS) solution.
- eBay Oink - REST based interface for PIG execution.
- Esper - a highly scalable, memory-efficient, in-memory computing, SQL-standard, minimal latency, real-time streaming-capable Big Data processing engine for historical data.
- Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- Facebook Scuba - distributed in-memory datastore.
- GearPump - a lightweight real-time big data streaming engine.
- Geotrellis - geographic data processing engine for high performance applications.
- GetStream Stream Framework - a Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- GraphLab Dato - fast, scalable engine of GraphLab Create, a Python library.
- Hazelcast - In-Memory Data Grid.
- HParser - data parsing transformation environment optimized for Hadoop.
- IBM Streams - advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Kryo - Java serialization and cloning: fast, efficient, automatic.
- LinkedIn Cubert - a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.
- Lipstick - Pig workflow visualization tool.
- Metamarkers Druid - framework for real-time analysis of large datasets.
- Microsoft Azure Stream Analytics - an event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data.
- Microsoft Orleans - a straightforward approach to building distributed high-scale computing applications.
- Microsoft Trill - a high-performance in-memory incremental analytics engine.
- Netflix Aegisthus - Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
- Netflix Lipstick - Pig Visualization framework.
- Netflix Mantis - Event Stream Processing System.
- Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
- Netflix STAASH - language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems.
- Netflix Surus - a collection of tools for analysis in Pig and Hive.
- Netflix Zeno - Netflix's In-Memory Data Propagation Framework.
- Nextflow - Dataflow oriented toolkit for parallel and distributed computational pipelines.
- Nokia Disco - MapReduce framework developed by Nokia.
- Parsely Streamparse - streamparse lets you run Python code against real-time streams of data. It also integrates Python smoothly with Apache Storm..
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
- Pinterest Pinlater - asynchronous job execution system.
- Pubnub - Data stream network.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- ScaleOut hServer - fast, scalable in-memory data grid for Hadoop.
- SeqPig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop .
- SigmoidAnalytics Spork - Pig on Apache Spark.
- spark-dataflow - allows users to execute dataflow pipelines with Spark.
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. .
- Spring for Apache Hadoop - unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive.
- SQLStream Blaze - stream processing platform.
- Stratio Streaming - the union of a real-time messaging bus with a complex event processing engine using Spark Streaming.
- Stratosphere - general purpose cluster computing framework.
- Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
- Sumo Logic - cloud based analyzer for machine-generated data..
- Teradata QueryGrid - data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop.
- TIBCO ActiveSpaces - in-memory data grid.
- Tigon - a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications.
- Torch - Scientific computing for LuaJIT.
- Trident - a high-level abstraction for doing realtime computing on top of Storm.
- Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
- Apache HDFS - a way to store large files across multiple machines.
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Ceph Filesystem - software storage platform designed.
- Disco DDFS - distributed filesystem.
- Facebook Haystack - object storage system.
- Google Colossus - distributed filesystem (GFS2).
- Google GFS - distributed filesystem.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- HDSF-DU - HDFS-DU is an interactive visualization of the Hadoop distributed file system. .
- Lustre file system - high-performance distributed filesystem.
- Netflix S3mper - library that provides an additional layer of consistency checking on top of Amazon's S3 index through use of a consistent, secondary index.
- Quantcast File System QFS - open-source distributed file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
- Tachyon - reliable file sharing at memory speed across cluster frameworks.
- Actian Vector - column-oriented analytic database.
- Apache Accumulo - distribuited key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
- Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
- Facebook HydraBase - evolution of HBase made by Facebook.
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Hypertable - column-oriented distribuited datastore, inspired by BigTable.
- InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
- MapR-DB - fast, scalable, and enterprise-ready in-Hadoop database architected to manage big data.
- Netflix Priam - Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
- OhmData C5 - improved version of HBase.
- Sqrrl - NoSQL databases on top of Apache Accumulo.
- Tephra - Transactions for HBase.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
- Actian Versant - commercial object-oriented database management systems .
- Amazon SimpleDB - a highly available and flexible non-relational data store that offloads the work of database administration.
- Clusterpoint - a database software for high-speed storage and large-scale processing of XML and JSON data on clusters of commodity hardware.
- Crate Data - is an open source massively scalable data store. It requires zero administration.
- Facebook Apollo - Facebook’s Paxos-like NoSQL database.
- jumboDB - document oriented datastore over Hadoop.
- LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- Microsoft DocumentDB - fully-managed, highly-scalable, NoSQL document database service.
- MongoDB - Document-oriented database system.
- RavenDB - A transactional, open-source Document Database.
- RethinkDB - document database that supports queries like table joins and group by.
- Terrastore - a modern document store which provides advanced scalability and elasticity features without sacrificing consistency.
- TokuMX - High-Performance MongoDB Distribution.
- Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies..
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- Couchbase ForestDB - Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie.
- Edis - is a protocol-compatible Server replacement for Redis.
- ElephantDB - Distributed database specialized in exporting data from Hadoop.
- EventStore - distributed time series database.
- Exasolution - an in-memory, column-oriented, relational database management system.
- HyperDex - next generation key-value store.
- KAI - a distributed key-value datastore.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- MemcacheDB - a distributed key-value storage system designed for persistent.
- Netflix Dynomite - thin Dynamo-based replication for cached data.
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
- RAMCloud - storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers.
- Redis - in memory key value datastore.
- Redis Cluster - distributed implementation of Redis.
- Redis Sentinel - system designed to help managing Redis instances.
- Riak - a decentralized datastore.
- Scalaris - a distributed transactional key-value store.
- Storehaus - library to work with asynchronous key value stores, by Twitter.
- Tarantool - an efficient NoSQL database and a Lua application server.
- TreodeDB - key-value store that's replicated and sharded and provides atomic multirow writes.
- Yahoo Sherpa - hosted, distributed and geographically replicated key-valueÊcloud storage platform.
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ArangoDB - multi model distribuited database.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- Faunus - Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
- Google Cayley - open-source graph database.
- Google Pregel - graph processing framework.
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- GraphX - resilient Distributed Graph System on Spark.
- Gremlin - graph traversal Language.
- HyperGraphDB - general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs.
- InfiniteGraph - distributed graph database.
- Infovore - RDF-centric Map/Reduce framework.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- MapGraph - Massively Parallel Graph processing on GPUs.
- Neo4j - graph database writting entirely in Java.
- OrientDB - document and graph database.
- Phoebus - framework for large scale graph processing.
- Pinterest Zen - Pinterest's Graph Storage Service.
- Sparksee - scalable high-performance graph database.
- Stardog - graph database: search, query, reasoning, and constraints in a lightweight, pure Java system.
- Titan - distributed graph database, built over Cassandra.
- Twitter FlockDB - distribuited graph database.
- Actian Ingres - commercially supported, open-source SQL relational database management system.
- BayesDB - statistic oriented SQL database.
- Cockroach - Scalable, Geo-Replicated, Transactional Datastore.
- Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
- FoundationDB - distributed database, inspired by F1.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- HandlerSocket - NoSQL plugin for MySQL/MariaDB.
- IBM DB2 - object-relational database management system.
- InfiniSQL - infinity scalable RDBMS.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Oracle Database - object-relational database management system.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
- Segment SQL - Track your customer data to Amazon Redshift.
- SenseiDB - distributed, realtime, semi-structured database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
- Teradata Database - complete relational database management system.
- VoltDB - in-memory NewSQL database.
- Amazon RedShift - data warehouse service, based on PostgreSQL.
- C-Store - column oriented DBMS.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Google Dremel - framework for interactive analysis, implementation of Dremel.
- MonetDB - column store database.
- Parquet - columnar storage format for Hadoop.
- Pivotal Greenplum - purpose-built, dedicated analytic data warehouse.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
- Cube - uses MongoDB to store time series data.
- Etsy StatsD - simple daemon for easy stats aggregation.
- InfluxDB - distributed time series database.
- Kairosdb - similar to OpenTSDB but allows for Cassandra.
- OpenTSDB - distributed time series database on top of HBase.
- Prometheus - an open-source service monitoring system and time series database.
- Square Cube - system for collecting timestamped events and deriving metrics.
- TempoIQ - Cloud-based sensor analytics.
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- AMPLAB Shark - data warehouse system for Spark.
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- Apache HCatalog - table and storage management layer for Hadoop.
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Apache Optiq - framework that allows efficient translation of queries involving heterogeneous and federated data.
- Apache Phoenix - SQL skin over HBase.
- BlinkDB - massively parallel, approximate query engine.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Concurrent Lingual - SQL-like query language for Cascading.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- eBay Kylin - Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.
- Facebook PrestoDB - distributed SQL query engine.
- Hadapt - a native implementation of SQL for the Apache Hadoop open-source project.
- JethroData - index-based SQL engine for Hadoop.
- Metanautix Quest - data compute engine.
- Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- SparkSQL - Manipulating Structured Data Using Spark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Stinger - interactive query for Hive.
- Tajo - distributed data warehouse system on Hadoop.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
- R-Studio - IDE for R.
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Apache BookKeeper - a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig.
- Apache Chukwa - data collection system.
- Apache Flume - service to manage large amount of log data.
- Apache Samza - stream processing framework, based on Kafla and YARN.
- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- Apache UIMA - Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user.
- Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
- Facebook Scribe - streamed log data aggregator.
- Fluentd - tool to collect events and logs.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- Heka - open source stream processing software system.
- HIHO - framework for connecting disparate data sources with Hadoop.
- LinkedIn Camus - Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.
- LinkedIn Databus - stream of change capture events for a database.
- LinkedIn Gobblin - a framework for Solving Big Data Ingestion Problem.
- LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
- Linkedin Lumos - bridge from OLTP to OLAP for use it on Hadoop.
- LinkedIn White Elephant - log aggregator and dashboard.
- Logstash - a tool for managing events and logs.
- Netflix Suro - data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa.
- Pinterest Secor - is a service implementing Kafka log persistance.
- Record Breaker - Automatic structure for your text-formatted data.
- TIBCO Enterprise Message Service - standards-based messaging middleware.
- Twitter Zipkin - distributed tracing system that helps us gather timing data for all the disparate services at Twitter.
- Vibe Data Stream - streaming data collection for real-time Big Data analytics.
- ActiveMQ - open source messaging and Integration Patterns server.
- Amazon Simple Queue Service - fast, reliable, scalable, fully managed queue service.
- Apache Kafka - distributed publish-subscribe messaging system.
- Apache Qpid - messaging tools that speak AMQP and support many languages and platforms.
- Apollo - ActiveMQ's next generation of messaging.
- Beanstalkd - simple, fast work queue.
- Bit.ly NSQ - realtime distributed message processing at scale.
- Celery - Distributed Task Queue.
- Crossroads I/O - library for building scalable and high performance distributed applications.
- Darner - simple, lightweight message queue.
- Facebook Iris - a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier.
- Gearman - Job Server.
- Google Cloud Pub/Sub - reliable, many-to-many, asynchronous messaging hosted on Google's infrastructure.
- HornetQ - open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system.
- IronMQ - easy-to-use highly available message queuing service.
- Kestrel - distributed message queue system.
- Marconi - queuing and notification service made by and for OpenStack, but not only for it.
- RabbitMQ - Robust messaging for applications.
- RestMQ - message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources.
- RQ - simple Python library for queueing jobs and processing them in the background with workers.
- Sidekiq - Simple, efficient background processing for Ruby.
- ZeroMQ - The Intelligent Transport Layer.
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- Google Chubby - a lock service for loosely-coupled distributed systems.
- Linkedin Norbert - cluster manager.
- MPICH - high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
- OpenMPI - message passing framework.
- Serf - decentralized solution for service discovery and orchestration.
- Spotify Luigi - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
- Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Twitter Elephant Bird - libraries for working with LZOP-compressed data.
- Twitter Finagle - asynchronous network stack for the JVM.
- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Apache Oozie - workflow job scheduler.
- Chronos - distributed and fault-tolerant scheduler.
- Linkedin Azkaban - batch workflow job scheduler.
- Pinterest Pinball - customizable platform for creating workflow managers.
- Sparrow - scheduling platform.
- Apache Mahout - machine learning library for Hadoop.
- Ayasdi Core - tool for topological data analysis.
- brain - Neural networks in JavaScript.
- Cloudera Oryx - real-time large-scale machine learning.
- Concurrent Pattern - machine learning library for Cascading.
- convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- cuDNN - GPU-accelerated library of primitives for deep neural networks.
- Decider - Flexible and Extensible Machine Learning in Ruby.
- etcML - text classification with machine learning.
- Etsy Conjecture - scalable Machine Learning in Scalding.
- fbcunn - Deep Learning CUDA Extensions from Facebook AI Research.
- Google Sibyl - System for Large Scale Machine Learning at Google.
- H2O - statistical, machine learning and math runtime for Hadoop.
- IBM Watson - cognitive computing system.
- LinkedIn ml-ease - ADMM based large scale logistic regression.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
- scikit-learn - scikit-learn: machine learning in Python.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Sparkling Water - combine H2OÕs Machine Learning capabilities with the power of the Spark platform.
- Theano - Python package for deep learning that can utilize NVIDIA's CUDA toolkit to run on the GPU.
- Thunder - Large-scale analysis of neural data.
- Vahara - Machine learning and natural language processing with Apache Pig.
- Viv - global platform that enables developers to plug into and create an intelligent, conversational interface to anything.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- WEKA - suite of machine learning software.
- Wit - Natural Language for the Internet of Things.
- Wolfram Alpha - computational knowledge engine.
- YHat ScienceOps - platform for deploying, managing, and scaling predictive models in production applications.
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- Big-Bench - Big Bench Workload Development.
- Hive-benchmarks - some benchmarking queries for Apache Hive.
- Hive-testbench - Testbench for experimenting with Apache Hive at any data scale..
- Intel HiBench - a Hadoop benchmark suite.
- Mesosaurus - Mesos task load simulator framework for (cluster and Mesos) performance analysis.
- Netflix Inviso - performance focused Big Data tool.
- PUMA Benchmarking - benchmark suite for MapReduce applications.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Ranger - framework to enable, monitor and manage comprehensive data security across the Hadoop platform (formerly called Apache Argus).
- Apache Sentry - security module for data stored in Hadoop.
- PacketPig - Open Source Big Data Security Analytics.
- Voltage SecureData - data protection framework.
- Ankush - A big data cluster management tool that creates and manages clusters of different technologies..
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Apache Whirr - set of libraries for running cloud services.
- Apache YARN - Cluster manager.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera Director - a comprehensive data management platform with the flexibility and power to evolve with your business.
- Cloudera HUE - web application for interacting with Hadoop.
- Deimos - Mesos containerizer hooks for Docker.
- Develoop - tool for provisioning, managing and monitoring Apache Hadoop.
- Etsy Sahale - Visualizing Cascading Workflows at Etsy.
- Facebook Autoscale - the load balancer will concentrate workload to a server until it has at least a medium-level workload.
- Facebook Prism - multi datacenters replication system.
- Ganglia Monitoring System - scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them..
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting..
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs..
- Marathon - Mesos framework for long-running services.
- Minotaur - scripts/recipes/configs to spin up VPC-based infrastructure in AWS from scratch and deploy labs to it.
- Myriad - a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies..
- Neflix SimianArmy - a suite of tools for keeping your cloud operating in top form.
- Tumblr Collins - Infrastructure management for engineers.
- Tumblr Genesis - a tool for data center automation.
- Amazon EC2 Container Service - a highly scalable, high performance container management service that supports Docker containers.
- Docker - an open platform for developers and sysadmins to build, ship, and run distributed applications.
- Fig - fast, isolated development environments using Docker.
- Google Container Engine - Run Docker containers on Google Cloud Platform, powered by Kubernetes.
- Kubernetes - open source implementation of container cluster management.
- Rocket - an alternative to the Docker runtime, designed for server environments with the most rigorous security and production requirements.
- Adobe Spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.
- Apache Kiji - framework to collect and analyze data in real-time, based on HBase.
- Apache Nutch - open source web crawler.
- Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
- Apache Tika - content analysis toolkit.
- Domino - Run, scale, share, and deploy models Ñ without any infrastructure..
- Eclipse BIRT - Eclipse-based reporting system.
- Eventhub - open source event analytics platform.
- HIPI Library - API for performing image processing tasks on Hadoop's MapReduce.
- Hunk - Splunk analytics for Hadoop.
- MADlib - data-processing library of an RDBMS to analyze data.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Sense - Cloud Platform for Data Science and Big Data Analytics.
- Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
- Splunk - analyzer for machine-generated date.
- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
- Apache Blur - a search engine capable of querying massive amounts of structured data at incredible speeds.
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig..
- Enigma.io - Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
- Facebook Unicorn - social graph search platform.
- Google Caffeine - continuous indexing system.
- Google Percolator - continuous indexing system.
- TeraGoogle - large search index.
- Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- HBase Coprocessor - implementation of Percolator, part of HBase.
- hIndex - Secondary Index for HBase.
- SF1R Search Engine - distributed search engine written in c++.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- LinkedIn Galene - search architecture at LinkedIn.
- LinkedIn Zoie - is a realtime search/indexing system written in Java.
- Sphnix Search Server - fulltext search engine.
- Amazon Aurora - a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
- Amazon RDS - MySQL databases in Amazon's cloud.
- Drizzle - evolution of MySQL 6.0.
- Google Cloud SQL - MySQL databases in Google's cloud.
- HiveDB - an open source framework for horizontally partitioning MySQL systems.
- MariaDB - enhanced, drop-in replacement for MySQL.
- MariaDB Galera - a synchronous multi-master cluster for MariaDB.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine providing shared-nothing clustering and auto-sharding.
- Percona Server - enhanced, drop-in replacement for MySQL.
- ProxySQL - High Performance Proxy for MySQL.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
- Youtube Vitess - provides servers and tools which facilitate scaling of MySQL databases for large scale web services.
- HadoopDB - hybrid of MapReduce and DBMS.
- IBM Netezza - high-performance data warehouse appliances.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
- Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
- Box Tron - proxy to memcached servers.
- Facebook McDipper - key/value cache for flash storage.
- Facebook Mcrouter - a memcached protocol router for scaling memcached deployments.
- Facebook Memcached - fork of Memcache.
- Twemproxy - A fast, light-weight proxy for memcached and redis.
- Twitter Fatcache - key/value cache for flash storage.
- Twitter Twemcache - fork of Memcache.
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
- BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
- eXtreme DB - in-memory database combines exceptional performance, reliability and developer efficiency in a proven real-time embedded database engine.
- FairCom c-treeACE - a cross-platform database engine.
- Google Firebase - a powerful API to store and sync data in realtime.
- HamsterDB - transactional key-value database.
- HanoiDB - Erlang LSM BTree Storage.
- LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
- LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
- RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
- TokioCabinet - a library of routines for managing a database.
- ActivePivot - Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing.
- Adatao - business intelligence and data science platform.
- Apama analytics - platform for streaming analytics and intelligent automated action.
- Atigeo xPatterns - data analytics platform.
- BIME Analytics - business intelligence platform in the cloud.
- Chartio - lean business intelligence platform to visualize and explore your data.
- Datapine - self-service business intelligence tool in the cloud.
- Jaspersoft - powerful business intelligence suite.
- Jedox Palo - customisable Business Intelligence platform.
- Lavastorm Analytics - used for audit analytics, revenue assurance, fraud management, and customer experience management.
- LinkedIn GoSpeed - provides RUM data processing, visualization, monitoring, and analyses data daily, hourly, or on a near real-time basis.
- Map-D - GPU in-memory database, big data analysis and visualization platform.
- Microsoft - business intelligence software and platform.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Pentaho - business intelligence platform.
- Qlik - business intelligence and analytics platform.
- SpagoBI - open source business intelligence platform.
- Spotfire - business intelligence platform.
- Tableau - business intelligence platform.
- Teradata Aster - Big Data Analytics.
- Tessera - Environment for Deep Analysis of Large Complex Data.
- Zeppelin - open source data analysis environment on top of Hadoop..
- Zoomdata - Big Data Analytics.
- LinkedIn Pinot - a distributed system that supports columnar indexes with the ability to add new types of indexes.
- Myria - scalable Analytics-as-a-Service platform based on relational algebra.
- Pinalytics - Pinterest�s data analytics engine.
- Zillabyte - an API for distributed data computation. Scale with your data..
- Google Mesa - highly scalable analytic data warehousing system.
- IBM BigInsights - data processing, warehousing and analytics.
- IBM dashDB - Data Warehousing and Analysis Needs, all in the Cloud.
- Microsoft Cosmos - Microsoft's internal BigData analysis platform.
- Arbor - graph visualization library using web workers and jQuery.
- C3 - D3-based reusable chart library.
- CartoDB - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
- Chart.js - open source HTML5 Charts visualizations.
- Chartist.js - another open source HTML5 Charts visualization.
- Crossfilter - avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
- Cubism - JavaScript library for time series visualization.
- Cytoscape - JavaScript library for visualizing complex networks.
- D3 - javaScript library for manipulating documents.
- DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
- Envisionjs - dynamic HTML5 visualization.
- FnordMetric ChartSQL - allows you to write SQL queries that return charts instead of tables. The charts are rendered as SVG vector graphics..
- Freeboard - open source real-time dashboard builder for IOT and other web mashups.
- Gephi - An award-winning open-source platform for visualizing and manipulating large graphs and network connections.
- Google Charts - simple charting API.
- Grafana - open source, feature rich metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB.
- Graphite - scalable Realtime Graphing.
- Highcharts - simple and flexible charting API.
- IPython - provides a rich architecture for interactive computing.
- Keylines - toolkit for visualizing the networks in your data.
- Kibana - visualize logs and time-stamped data.
- Matplotlib - plotting with Python.
- NVD3 - chart components for d3.js.
- Peity - Progressive SVG bar, line and pie charts.
- Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots..
- Recline - simple but powerful library for building data applications in pure Javascript and HTML.
- Redash - open-source platform to query and visualize data.
- Sigma.js - JavaScript library dedicated to graph drawing.
- Square Cubism.js - aÊD3Êplugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data fromÊGraphite,ÊCubeÊand other sources.
- Vega - a visualization grammar.
- 2lemetry - Platform for Internet of things.
- Evrything - Making products smart.
- ThingWorx - Rapid development and connection of intelligent systems.
- Published in 2015
- Published in 2014
- Published in 2013
- Published in 2012
- Published in 2011
- Published in 2010
- Published in 2009
- Published in 2008
- Published in 2007
- Published in 2006
- Published in 2005
- Published in 2004
- Published in 2003
- Published in 2002
- Published in 2001
- Published in 2000
- Published in 1999
- Published in 1998
- Published in 1997
- 2015 - Deep Image: Scaling up Image Recognition
- 2015 - Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
- 2015 - Machine Learning Classification over Encrypted Data
- 2015 - Machine Learning Methods for Computer Security
- 2015 - Self-Repairing Disk Arrays
- 2015 - Trill: A High-Performance Incremental Query Processor for Diverse Analytics
- 2014 - 3D Object Manipulation in a Single Photograph using Stock 3D Models
- 2014 - A Partitioning Framework for Aggressive Data Skipping
- 2014 - A Self-Configurable Geo-Replicated Cloud Storage System
- 2014 - All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
- 2014 - Arrakis: The Operating System is the Control Plane
- 2014 - Automatic Construction of Inference-Supporting Knowledge Bases
- 2014 - Bayesian group latent factor analysis with structured sparse priors
- 2014 - Chinese Open Relation Extraction for Knowledge Acquisition
- 2014 - Coordination Avoidance in Database Systems
- 2014 - DeepFace: Closing the Gap to Human-Level Performance in Face Verification
- 2014 - Diagram Understanding in Geometry Questions
- 2014 - Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
- 2014 - Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
- 2014 - Eidetic Systems
- 2014 - Execution Primitives for Scalable Joins and Aggregations in Map Reduce
- 2014 - Extracting More Concurrency from Distributed Transactions
- 2014 - f4: Facebook�s Warm BLOB Storage System
- 2014 - Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
- 2014 - Fastpass: A Centralized "Zero-Queue" Datacenter Network
- 2014 - First-person Hyper-lapse Videos
- 2014 - GloVe: Global Vectors for Word Representation
- 2014 - GraphX: Graph Processing in a Distributed Dataflow Framework
- 2014 - Guess Who Rated This Movie: Identifying Users Through Subspace Clustering
- 2014 - In Search of an Understandable Consensus Algorithm
- 2014 - Learning Everything about Anything: Webly-Supervised Visual Concept Learning
- 2014 - Learning to Solve Arithmetic Word Problems with Verb Categorization
- 2014 - Log-structured Memory for DRAM-based Storage
- 2014 - Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
- 2014 - MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
- 2014 - Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
- 2014 - Modeling Biological Processes for Reading Comprehension
- 2014 - Orca A Modular Query Optimizer Architecture for Big Data
- 2014 - Pigeon: A Spatial MapReduce Language
- 2014 - Project Adam: Building an Efficient and Scalable Deep Learning Training System
- 2014 - Quantum Deep Learning
- 2014 - R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
- 2014 - Salt: Combining ACID and BASE in a Distributed Database
- 2014 - Scalable Object Detection using Deep Neural Networks
- 2014 - Sequence to Sequence Learning with Neural Networks
- 2014 - Show and Tell: A Neural Image Caption Generator
- 2014 - Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
- 2014 - The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
- 2014 - The Trill Incremental Analytics Engine
- 2013 - A Demonstration of SpatailHadoop: An Efficient MapReduce Framework for Spatial Data
- 2013 - A Lightweight and High Performance Monolingual Word Aligner
- 2013 - Answer Extraction as Sequence Tagging with Tree Edit Distance
- 2013 - Automatic Coupling of Answer Extraction and Information Retrieval
- 2013 - CG_Hadoop: Computational Geometry in MapReduce
- 2013 - Consistency-Based Service Level Agreements for Cloud Storage
- 2013 - Dimension Independent Matrix Square using MapReduce
- 2013 - Druid A Real-time Analytical Data Store
- 2013 - Efficient Estimation of Word Representations in Vector Space
- 2013 - Event labeling combining ensemble detectors and background knowledge
- 2013 - Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask
- 2013 - F1: A Distributed SQL Database That Scales
- 2013 - Fast Training of Convolutional Networks through FFTs
- 2013 - GraphX: A Resilient Distributed Graph System on Spark
- 2013 - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality 2013 Estimation Algorithm
- 2013 - MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- 2013 - MLbase: A Distributed Machine-learning System
- 2013 - Naiad: A Timely Dataflow System
- 2013 - Online, Asynchronous Schema Change in F1
- 2013 - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
- 2013 - Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- 2013 - Rich feature hierarchies for accurate object detection and semantic segmentation
- 2013 - Scalable Progressive Analytics on Big Data in the Cloud
- 2013 - Scaling Memcache at Facebook
- 2013 - Scuba: Diving into Data at Facebook
- 2013 - Semi-Markov Phrase-based Monolingual Alignment
- 2013 - Shark: SQL and Rich Analytics at Scale
- 2013 - Some Improvements on Deep Convolutional Neural Network Based Image Classification
- 2013 - TAO: Facebook�s Distributed Data Store for the Social Graph
- 2013 - Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code
- 2013 - Unicorn: A System for Searching the Social Graph
- 2013 - Warp: Lightweight Multi-Key Transactions for Key-Value Stores
- 2012 - A Few Useful Things to Know about Machine Learning
- 2012 - A Sublinear Time Algorithm for PageRank Computations
- 2012 - Avatara: OLAP for Web-scale Analytics Products
- 2012 - Blink and It's Done. Interactive Queries on Very Large Data
- 2012 - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 2012 - Building high-level features using large scale unsupervised learning
- 2012 - Dimension Independent Similarity Computation
- 2012 - Earlybird: Real-Time Search at Twitter
- 2012 - Fast and Interactive Analytics over Hadoop Data with Spark
- 2012 - HyperDex: A Distributed, Searchable Key-Value Store
- 2012 - ImageNet Classification with Deep Convolutional Neural Networks
- 2012 - Large:Scale Machine Learning at Twitter
- 2012 - Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
- 2012 - Paxos Made Parallel
- 2012 - Paxos Replicated State Machines as the Basis of a High-Performance Data Store
- 2012 - Perspectives on the CAP Theorem
- 2012 - Processing a Trillion Cells per Mouse Click
- 2012 - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
- 2012 - Spanner: Google's Globally-Distributed Database
- 2012 - Temporal Analytics on Big Data for Web Advertising
- 2012 - The Unified Logging Infrastructure for Data Analytics at Twitter
- 2012 - The Vertica Analytic Database- C-Store 7 Years Later
- 2011 - Consistency, Availability, and Convergence
- 2011 - CrowdDB: Answering Queries with Crowdsourcing
- 2011 - CrowdDB: Query Processing with the VLDB Crowd
- 2011 - Fast Crash Recovery in RAMCloud
- 2011 - Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- 2011 - It's Time for Low Latency
- 2011 - Matching Unstructured Product Offers to Structured Product Specifications
- 2011 - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- 2011 - Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- 2011 - Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters
- 2010 - Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- 2010 - Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
- 2010 - Dremel: Interactive Analysis of Web-Scale Datasets
- 2010 - Finding a needle in Haystack- Facebook's photo storage
- 2010 - FlumeJava: Easy, Eff¥cient Data-Parallel Pipelines
- 2010 - Large:scale Incremental Processing Using Distributed Transactions and Notifications
- 2010 - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- 2010 - Pregel: A System for Large-Scale Graph Processing
- 2010 - S4: Distributed Stream Computing Platform
- 2010 - Spark: Cluster Computing with Working Sets
- 2010 - The Learning Behind Gmail Priority Inbox
- 2010 - ZooKeeper: Wait-free coordination for Internet-scale systems
- 2009 - Cassandra - A Decentralized Structured Storage System
- 2009 - Feature Hashing for Large Scale Multitask Learning
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- 2009 - Vertical Paxos and Primary-Backup Replication
- 2008 - Chukwa: A large-scale monitoring system
- 2008 - Column:Stores vs. Row-Stores- How Different Are They Really?
- 2008 - PNUTS: Yahoo!Õs Hosted Data Serving Platform
- 2008 - Top 10 algorithms in data mining
- 2007 - Architecture of a Database System
- 2007 - Consistent Streaming Through Time: A Vision for Event Stream Processing
- 2007 - Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
- 2007 - Dynamo: Amazon's Highly Available Key-value Store
- 2007 - Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments
- 2007 - Life beyond Distributed Transactions: an ApostateÕs Opinion
- 2007 - Paxos Made Live - An Engineering Perspective
- 2006 - Bigtable: A Distributed Storage System for Structured Data
- 2006 - Ceph: A Scalable, High-Performance Distributed File System
- 2006 - Map-Reduce for Machine Learning on Multicore
- 2006 - The Chubby lock service for loosely-coupled distributed systems
- 2005 - Fast Paxos
- 2003 - The Google File System
- 2002 - Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
- 2001 - Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
- 2001 - Paxos Made Simple
- 2001 - Random Forrest