Databases and Data Warehouses | ||||
---|---|---|---|---|
GitHub Repo | Official page | Questions | Description | Useful links |
Apache Cassandra | Cassandra is a distributed, wide-column store, NoSQL database management system. | Awesome Cassandra | ||
Greenplum | Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. | Awesome Greenplum | ||
MongoDB | MongoDB is a document-oriented database. | Awesome MongoDB | ||
Apache Hbase | HBase is an open-source non-relational distributed database. | Awesome HBase | ||
Apache Hive | Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. | Awesome Hive | ||
Amazon DynamoDB | Amazon DynamoDB is a fully managed proprietary NoSQL database service. | Awesome DynamoDB Awesome AWS | ||
Amazon Redshift | Amazon Redshift is a data warehouse product. | Amazon Redshift Utilities Awesome AWS | ||
BigQuery GCP | BigQuery is a fully-managed, serverless data warehouse. | Awesome BigQuery | ||
Bigtable GCP | Bigtable is a fully managed wide-column and key-value NoSQL database service. | Awesome Bigtable | ||
Data Formats | ||||
Apache Avro | Avro is a row-oriented remote procedure call and data serialization framework. | Awesome Avro | ||
Apache Parquet | Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. | TODO | ||
Delta | Delta Lake is a storage framework that enables building a Lakehouse architecture with compute engines | Delta examples | ||
Big Data Frameworks | ||||
Apache Airflow | Apache Airflow is a workflow management platform for data engineering pipelines. | Awesome Airflow | ||
Apache Flume | Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. | TODO | ||
Apache Hadoop | Apache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. | Awesome Hadoop | ||
Apache Impala | Apache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop. | TODO | ||
Apache Kafka | Apache Kafka is a distributed event store and stream-processing platform. | Awesome Kafka | ||
Apache NiFi | Apache NiFi is a software project designed to automate the flow of data between software systems. | Awesome NiFi | ||
Apache Spark | Apache Spark is unified analytics engine for large-scale data processing. | Awesome Spark | ||
Apache Flink | Apache Flink is unified stream-processing and batch-processing framework. | Awesome Flink | ||
Kubernetes | Kubernetes is a system for managing containerized applications across multiple hosts. | Awesome Kubernetes | ||
Cloud providers | ||||
Amazon Web Services | Amazon web service is an online platform that provides scalable and cost-effective cloud computing solutions. | Awesome AWS | ||
Microsoft Azure | Microsoft Azure is Microsoft's public cloud computing platform. | Awesome Azure | ||
Google Cloud Platform | Google Cloud Platform is a suite of cloud computing services. | Awesome GCP | ||
Theory | ||||
DWH Architectures | A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise. | Awesome databases | ||
Data Structures | A data structure is a specialized format for organizing, processing, retrieving and storing data. | TODO | ||
SQL | SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). | Awesome SQL | ||
Data visualization tools | ||||
Tableau | Tableau is a powerful data visualization tool used in the Business Intelligence. | TODO |