Data Systems Design and Implementation

Collection of useful notes, readings, videos, codes related to data systems design

1. System Design Resources
2. Data Architecture
3. Data Storage
4. Data Consumption
5. Data Processing
6. Data Governance
7. Data Integration
8. Machine Learning
9. Platform and Infra
10. Testing
11. Operations
12. Interview Resources
13. Databricks
14. Performance
15. Databases

1. System Design Resources

1.1. Learning Resources Repos

1.2. Books

1.3. Blogs

Replication

Blog - Data Replication Design Spectrum

Performance

A Programmer-Friendly I/O Abstraction Over io_uring and kqueue

1.4. Papers

(also available in Papers folder)

Google, Borg: the next generation
Google File System
ZooKeeper: Wait-free coordination for Internet-scale systems
Dynamo: Amazon’s Highly Available Key-value Store
- Video, Christopher Batey - DYNAMO: the paper that changed the database world
An Empirical Evaluation of Columnar Storage Formats
Froid: Optimization of Imperative Programs in a Relational Database (K. Ramachandra, et al., VLDB 2017)
Don’t Hold My Data Hostage: A Case for Client Protocol Redesign (M. Raasveldt, et al., VLDB 2017)
An Overview of Query Optimization in Relational Systems (S. Chaudhuri, PODS 1998)
Dremel: A Decade of Interactive SQL Analysis at Web Scale (S. Melnik, et al., VLDB 2020)
Photon: A Fast Query Engine for Lakehouse Systems (A. Behm, et al., SIGMOD 2022)
The Snowflake Elastic Data Warehouse (B. Dageville, et al., SIGMOD 2016)
DuckDB: an Embeddable Analytical Database (M. Raasveldt, et al., SIGMOD 2019)
Yellowbrick: An Elastic Data Warehouse on Kubernetes (M. Cusack, et al., CIDR 2024)
Amazon Redshift Re-Invented (N. Armenatzoglou, et al., SIGMOD 2022)
Understanding the Performance Implications of the Design Principles in Storage-Disaggregated Databases
SparkFuzz - searching correctness regressions in modern query engines

1.5. Video Courses

1.6. Learning Goals

 can describe following terms
- state machine replication within a shard
- two-phase locking for serializable transaction isolation
- two-phase commit for cross-shard atomicity

2. Data Architecture

Blogs

Grab, Enabling near real-time data analytics on the data lake (https://engineering.grab.com/enabling-near-realtime-data-analytics)
A Taxonomy Of Data Change Events
What Happened to the Operational Data Store?

Videos * Youtube, Real Time with Redpanda, Episode 4: Write path optimizations for tiered storage

3. Data Storage

Terms

Table Formats (Iceberg, Hudi, Delta, OneTable, etc.).
Features (ACID transaction, Schema Evolution, Upsert, Time Travel, etc.).
Tech ( Copy On Write vs Merge On Read, Compaction, Vacuum, OCC (Optimistic Concurrency Control), MVCC (Multiversion Concurrency Control) )

Blogs and Videos

RFCs

KIP-405: Kafka Tiered Storage

4. Data Consumption

Blogs

Traveloka - Data Lake API on Microservice Architecture using BigQuery.
- Best Practices? avoid giving direct access to data platform storage (object storage, database, etc.) as it creates a tight coupling to the underlying technology, format, etc. Instead, have an API layer in between to decouple that dependency.
- What’s bad about direct access?
  - change coordination required between teams.
  - lack of access control (column, row levels).
  - lack of audit log (who access, download what).
Building Criteo API, What We’ve Learned (https://medium.com/criteo-engineering/building-criteo-api-what-weve-learned-b7f3e7b8d270). Key lessons learned after building a new API ecosystem from scratch.
Idempotency Keys: How PayPal and Stripe Prevent Duplicate Payment(https://medium.com/@sahintalha1/the-way-psps-such-as-paypal-stripe-and-adyen-prevent-duplicate-payment-idempotency-keys-615845c185bf)
How We Design Our APIs at Slack (https://slack.engineering/how-we-design-our-apis-at-slack/)
Grafana - How I write HTTP services in Go after 13 years (https://grafana.com/blog/2024/02/09/how-i-write-http-services-in-go-after-13-years/)
Introducing DoorDash’s In-House Search Engine (https://doordash.engineering/2024/02/27/introducing-doordashs-in-house-search-engine/)

5. Data Processing

Blogs

Agoda, How to Design and Maintain a High-Performing Data Pipeline
- Data pipeline scalability: SLA, partioning, data freshness, resource usage, scheduling, data dependency, monitoring.
- Data quality: freshness, integrity (uniqueness e.g. no dup keys), completeness (e.g. no empty, NULLS), accuracy (value is not abnormal by checking with previous trend, ThridEye), consistency (source = destination, Quilliup, running when pipeline completes).
- Ensuring data quality: validating before writing to destination, testing, monitoring, alerting, responding, automatic Jira tickets creation.
Scheduling Data Pipelines at Criteo — Part 1
Orchestrating Data/ML Workflows at Scale With Netflix Maestro
Netflix’s Dataflow: bootstrapping, standardization, automation of batch data pipelines
Uber
- uWorc: No Code Workflow Orchestrator for Building Batch & Streaming Pipelines at Scale, 2020
- Sparkle: Standardizing Modular ETL at Uber, 2024

Papers

Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (apache/datafusion#6782). Written in Rust, uses Apache Arrow as memory model.

Projects

https://github.com/flyteorg/flyte: orchestrator

6. Data Governance

Metadata management, data quality, data veracity, data security, data lineage, etc.

Open Metadata (https://open-metadata.org/)

6.1. Data Quality

Blogs

[How Google, Uber, and Amazon Ensure High-Quality Data at Scale](https://medium.com/swlh/how-3-of-the-top-tech-companies-approach-data-quality-79c3146fd959)
[Uber - Monitoring Data Quality at Scale with Statistical Modeling](https://www.uber.com/en-VN/blog/monitoring-data-quality-at-scale)
[LinkedIn - Towards data quality management at LinkedIn](https://engineering.linkedin.com/blog/2022/towards-data-quality-management-at-linkedin)
[Data Quality: Timeseries Anomaly Detection at Scale with Thirdeye](https://medium.com/the-ab-tasty-tech-blog/data-quality-timeseries-anomaly-detection-at-scale-with-thirdeye-468f771154e6)
How we deal with Data Quality using Circuit Breakers (https://medium.com/@modern-cdo/taming-data-quality-with-circuit-breakers-dbe550d3ca78)
Lyft - From Big Data to Better Data: Ensuring Data Quality with Verity (https://eng.lyft.com/from-big-data-to-better-data-ensuring-data-quality-with-verity-a996b49343f6)
Data Quality Automation at Twitter (https://blog.x.com/engineering/en_us/topics/infrastructure/2022/data-quality-automation-at-twitter)

Papers

[VLDB, Amazon - Automating Large-Scale Data Quality Verification](https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). It presents the design choices and architecture of a production-grade system for checking data quality at scale, shows the evaluation result on some datasets.

Best Practices

too little data quality alerts let important issues go unresolved.
too many alerts overwhelms and might make the most important ones go un-noticed.
statistical modeling techniques (PCA, etc.) can be used to reduce computation resource for quality checks.
separate anomaly detection from anomaly scoring and alerting strategy.

Common Issues

issues in metadata category (data availability, data freshness, schema changes, data completeness) → can be obtained without checking dataset content
issues in semantic category(dataset content: column value nullability, duplication, distribution, exceptional values, etc.) → needs data profiling

6.2. Data Lineage

Blogs

Data Lineage at Slack (https://slack.engineering/data-lineage-at-slack/).
- Lineage service exposes endpoints for ingestion, stores data in RDS.
- Ingestion for Airflow DAGs built into existing dags using Airflow callbacks.
- Ingestion for Presto dashboards: audit tables, SQL Parsing.
OpenLineage, open framework for data lineage collection and analysis (https://openlineage.io/)
How we compute data lineage at Criteo (https://medium.com/criteo-engineering/how-we-compute-data-lineage-at-criteo-b3f09fc5c577)
Yelp - Spark Data Lineage (https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html)
Data Lineage: State-of-the-art and Implementation Challenges (https://medium.com/bliblidotcom-techblog/data-lineage-state-of-the-art-and-implementation-challenges-1ea8dccde9de)

6.3. Data Security, Compliance

GDPR, CCPA, PII Protection, etc.

Lyft - A Federated Approach To Providing User Privacy Rights (https://eng.lyft.com/a-federated-approach-to-providing-user-privacy-rights-3d9ab73441d9). Technical strategies for CCPA. Implementation of user data export and deletion. Federated design with central orchestration for exporting/ deleting.
Intuit - 10 lessons learned in operationalizing GDPR at scale (https://medium.com/ssdr-book/10-lessons-learned-in-operationalizing-gdpr-at-scale-7a41318846b6)

7. Data Integration

Blogs

How Agoda manages 1.8 trillion Events per day on Kafka (https://medium.com/agoda-engineering/how-agoda-manages-1-8-trillion-events-per-day-on-kafka-1d6c3f4a7ad1)
Apache Kafka Rebalance Protocol, or the magic behind your streams applications (https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2)

8. Machine Learning

Featureflow: Democratizing ML for Agoda (https://medium.com/agoda-engineering/featureflow-democratizing-ml-for-agoda-aec7a6c45b30)
- Challenge: time-consuming feature analysis, training, validation vs fast changing customers and competitors in travel industry; lacking of consistency from analysis to training, from feature development to deployment.
- Solution: Featureflow with components (UI, data pipeline, monitoring, sandbox env, experiment platform)
- Result: feature analysis reduced from a week to a day, quarterly experiments increased from 6 to 20, feature contributors from ~3 to ~50, larger feature pool, more robust feature screening process.
How ByteDance Scales Offline Inference with multi-modal LLMs to 200 TB Data (https://www.anyscale.com/blog/how-bytedance-scales-offline-inference-with-multi-modal-llms-to-200TB-data)
Building Real-time Machine Learning Foundations at Lyft (https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e)

9. Platform and Infra

Kubernetes

Lessons From Our 8 Years Of Kubernetes In Production (https://medium.com/@.anders/learnings-from-our-8-years-of-kubernetes-in-production-two-major-cluster-crashes-ditching-self-0257c09d36cd)

Terraform

Slack - How We Use Terraform At Slack (https://slack.engineering/how-we-use-terraform-at-slack/)

Network

Slack - Traffic 101: Packets Mostly Flow (https://slack.engineering/traffic-101-packets-mostly-flow/)

10. Testing

Slack - Continuous Load Testing (https://slack.engineering/continuous-load-testing/)

11. Operations

Observability, Monitoring

Observability @ Data Pipelines (https://medium.com/ssdr-book/observability-data-pipelines-99eda62b1704)

Incidents

Slack’s Incident on 2022-Feb-22 (https://slack.engineering/slacks-incident-on-2-22-22/)

12. Interview Resources

Preparing for Interview at Agoda: interview process at Agoda with advices for candidates in each stage.
System Design Cheatsheet

13. Databricks

Architecture

A data architecture pattern to maximize the value of the Lakehouse (https://www.databricks.com/blog/data-architecture-pattern-maximize-value-lakehouse.html)

Data Pipelines

How to Evaluate Data Pipelines for Cost to Performance (https://www.databricks.com/blog/2020/11/13/how-to-evaluate-data-pipelines-for-cost-to-performance.html)

Spark and Databricks Compute

Advanced Topics on Spark Optimization and Debug (https://holdenk.github.io/spark-flowchart)
Example Code for High Performance Spark book (https://github.com/high-performance-spark/high-performance-spark-examples)

Delta Lake

[Managing Recalls with Barcode Traceability on the Delta Lake](https://www.databricks.com/blog/managing-recalls-barcode-traceability-delta-lake)
[Creating a Spark Streaming ETL pipeline with Delta Lake at Gousto](https://medium.com/gousto-engineering-techbrunch/creating-a-spark-streaming-etl-pipeline-with-delta-lake-at-gousto-6fcbce36eba6)
- issues and solutions
  - costly Spark op MSCK REPAIR TABLE because it needs to scan table' sub-tree in S3 bucket. → use ALTER TABLE ADD PARTITION instead.
  - not caching dataframes for multiple usages. → use cache
  - rewriting all destination table incl. old partitions when having a new partition. → append new partition to destination.
  - architecture (waiting for CI, Airflow triggering, EMR spinning up, job run, working with AWS console for logs) slowing down development. Min feedback loop of 20 minutes. → move away from EMR, adopt a platform allowing to have complete control of clusters and prototyping.
- Databricks Pros
  - Reducing ETL time, latency from 2 hours to 15s by using streaming job and delta architecture.
  - Spark Structured Streaming Autoloader helps manage infra (setting up bucket noti, SNS and SQS in the background).
  - Notebook helps prototype on/ explore production data, debug with traceback and logs interactively. Then CICD to deploy when code is ready. This helps reduce dev cycle from 20 mins to seconds.
  - Costs remain the same as before Databricks. (using smaller instances with streaming cluster, which compensated for DBx higher costs vs EMR).
  - Reducing complexity in codebase and deployment (no Airflow).
  - Better ops: performance dashboards, Spark UI, reports.
- Other topics: DBT for data modeling, Redshift, SSOT.
[Data Modeling Best Practices & Implementation on a Modern Lakehouse](https://www.databricks.com/blog/data-modeling-best-practices-implementation-modern-lakehouse)

Governance

Implementing the GDPR 'Right to be Forgotten' in Delta Lake (https://www.databricks.com/blog/2022/03/23/implementing-the-gdpr-right-to-be-forgotten-in-delta-lake.html) Approaches: 1-Data Amnesia, 2-Anonymization, 3-Pseudonymization/Normalized tables. Speed up point DELETE by data skipping optimization with Z-order on DELETE where fields.

Backfilling

14. Performance

Papers

NanoLog: A Nanosecond Scale Logging System. github repo
- Implemented fast, low latency, high thruput C++ logging system.
  - 10-100x faster than existing systems (Log4js, spdlog)
  - maintains printf-like semantics
  - 80M msg/sec at median latency of 8ns in microbenchmarks, 18ns in apps
- How?
  - shifting work out of the runtime hot path and into the compilation and post-execution phases of the application.
  - deferring formatting to an offline process.
- Benefit: allows detailed logs in low latency systems
- costs: 512KB RAM per thread, one core, disk BW.

sonhmai/data-system-design

Data Systems Design and Implementation