Awesome Data Engineering Repository from the Philippines
Join our growing community!
Data Engineering Pilipinas | Facebook Group FB Page
Data Engineering Pilipinas | Discord Group
Data Engineering Pilipinas | Datacamp Studygroup Discord
Data Engineering Pilipinas | Meetup
Data Engineering Pilipinas | Youtube
FB Chat Topic & Link | Description |
---|---|
Data Trainings & Learning | Boost your data skills with training and bootcamps. |
Data Infra & Platforms | Explore data infrastructure and platforms for efficient data management. |
Data Governance & Quality | Ensure data accuracy and security with governance practices. |
Data Modeling & Design | Design databases and systems with effective data modeling. |
Data Integration | Extract, transform, and load data for analysis and insights. |
Data Engineering Pilipinas is a community for data engineers, data analysts, data scientists, developers, AI / ML engineers, and users of closed and open source data tools and methods / techniques in the Philippines. Data Engineering Pilipinas is a PyData group.
This will serve as a repository of notes, thoughts, ideas, plans, dreams, datasets, analyses, and whatever else we think of.
- [Study Roadmap]
- [Free Study Resources]
- Data Storage & Databases
- Data Ingestion
- Data Formats
- Stream Procesisng
- Batch Processing
- Workflow Orchestration
- Data Transformation
- Data Governance
- Data Platforms
- Community Contents
- PostgreSQL - is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
- MySQL - the most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation.
- Amazon Relational Database System (RDS) is a collection of managed services that makes it simple to set up, operate, and scale databases in the cloud. Choose from seven popular engines — Amazon Aurora with MySQL compatibility, Amazon Aurora with PostgreSQL compatibility, MySQL, MariaDB, PostgreSQL, Oracle, and SQL Server
- Amazon Redshift - Store, analyze, and process large amounts of data. Cloud Data Warehouse. PostgreSQL backend. MPP Engine and architecture. Available in Provisioned or Serverless.
- Google BigQuery - is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.
- Redis - is an open source (BSD licensed), in-memory key-value cache, message broker, and streaming engine.
- Amazon DynamoDB - is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale.
- Amazon S3 - is an object storage service offering industry-leading scalability, data availability, security, and performance.
- Azure Blob Storage - massively scalable and secure object storage for cloud-native workloads, archives, data lakes, high-performance computing, and machine learning.
- Google Cloud Storage - is a managed service for storing unstructured data. Store any amount of data and retrieve it as often as you like.
- Apache Kafka - a distributed event streaming platform.
- Apache Kafka (open-source)
- Apache Kafka (Confluent) - A Fully Managed Service of Apache Kafka that offers support from Kafka Committer-led experts, 99.99% uptime SLA, and etc. Apache Kafka in Confluent is Cloud-Native
- Amazon Managed Streaming for Apache Kafka (Amazon MSK) - is a Fully Managed Kafka Service that operates, maintains, and scales Apache Kafka clusters, provides enterprise-grade security features out of the box, and has built-in AWS integrations that accelerate development of streaming data applications. Apache Kafka in AWS is Cloud-Hosted
- AWS SDK for pandas (AWS Wrangler) - an open source python initiative that extends the power of the pandas library to AWS, connecting DataFrames and AWS data & analytics services. Open-source
- AWS Kinesis - A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
- Airbyte - A data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.
- Pentaho Data Integration (Kettle) - consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.
- Apache Arvo - is the leading serialization format for record data, and first choice for streaming data pipelines.
- Apache Parquet - is an open source, column-oriented data file format designed for efficient data storage and retrieval.
- Apache ORC - the smallest, fastest columnar storage for Hadoop workloads.
- Delta Lake - is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. Led by Databricks.
- Apache Iceberg - is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. Developed by Netflix
- Apache Hudi - (pronounced Hoodie), stands for
Hadoop Upserts Deletes and Incrementals
. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Developed by Uber.
- Apache Spark - is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Polars (Python) - is a lightning fast DataFrame library/in-memory query engine.
- Dask (Python) - is a flexible library for parallel computing in Python.
- Presto - is a distributed SQL query engine for big data that allows you to run SQL queries against various data sources.
- Apache Hive - is built on top of Apache Hadoop. A distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.
- Apache Drill- is an Apache open-source SQL query engine for Big Data exploration.
- Trino - is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
- AWS Elastic MapReduce (EMR) - is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.
- AWS Glue - is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development.
- Spark Streaming (DStreams) - an extension of core Spark API for processing of live data streams. Deprecated as of Spark 2.0.
- Spark Structured Streaming (DataFrames) - is a stream processing engine built on the Spark SQL engine.
- Apache Flink - is a framework and distributed processing engine for stateful computations over Data Streams
- Apache Storm - is a free and open source distributed realtime computation system. Doing for realtime processing what Hadoop did for batch processing.
- Apache Druid - is a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
- Apache Pinot - realtime distributed OLAP datastore, designed to answer OLAP queries with low latency
- Apache Airflow - is an open-source workflow management platform for data engineering pipelines. Built by Airbnb.
- Mage - Open-source data pipeline tool for transforming and integrating data. The modern replacement for Airflow.
- Dagster - An orchestration platform for the development, production, and observation of data assets.
- Prefect - is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines.
- Kestra - is a universal open-source orchestrator that makes both scheduled and event-driven workflows easy
- AWS Step Functions - is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows.
- Data Build Tool (dbt) - is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.
- SQLMesh - is an open source data transformation framework that brings the best practices of DevOps to data teams. It enables data scientists, analysts, and engineers to efficiently run and deploy data transformations written in SQL or Python.
- DataHub Project - is an extensible metadata platform that enables data discovery, data observability and federated governance to help tame the complexity of your data ecosystem. Has open-source and Managed. Built by LinkedIn.
- OpenMetadata - A Single Place to Discover, Collaborate and get your Data Right. Open-source. Inspired by Uber's metadata platform.
- Apache Atlas - is an open-source metadata and big data governance framework which helps data users collaborate on their data assets. Open-source. Incubated by Hortonworks.
- Amundsen - Open source data discovery and metadata engine. Created by Lyft.
- Great Expectations - a platform for Data Quality.
- Open-source - is a Python library that provides a framework for describing the acceptable state of data and then validating that the data meets those criteria.
- Cloud (SaaS) -
- Databricks - Founders of Apache Spark. Combines Data Warehouse and Data Lake (Data Lake House) into a platform. Unified. Open. Scalable. Try it free for 14 days. Suggest that you use AWS as the choice of platform.
- Snowflake - The Data Cloud. Cloud-native Data Warehouse Platform. Consists of Cloud Services Layer, Compute Layer, and Data Storage Layer. Try it free for 30 days.
Work in progress
Description:
This video provides an introduction to Data Engineering. In partnership with StudevPH with guest speaker, Josh Dev
Link: https://www.facebook.com/studevph/videos/165090273259790
Description:
This video discusses building a career in PH Tech Startups.
- In Partnership with Filipino Web Development Peers, Hosted by FWDP Founder, David Genesis Pedeglorio
- Guest Speaker: Andoy Montiel, Chief Data Officer of Packworks
Link: https://youtu.be/pzxFTFB8f6s
Description:
This video discusses careers in Analytics in the Philippines.
- Guest Sherwin Pelayo, and hosted by Doc Ligot
Link: https://www.youtube.com/watch?v=_CjsYi9ivlc
Description:
A FREE online event featuring content creators and thought leaders in the tech field:
- JP "Sir JP" Lazro & Rhea Alum, StudevPH
- Seiji Villafranca, Angular PH
- David Genesis Pedeglorio & Renzo Marl Peralta, Filipino Web Development Peers
- Josh "Josh Dev" Valdeleon, Data Engineering Pilipinas
- Hosted by Kuya Dev and Doc Ligot
Link: https://www.facebook.com/watch/live/?ref=watch_permalink&v=1043806640102969
- Kyle Escosia - A Data Engineer who is passionately curious in anything about data.