Collection of useful notes, readings, videos, codes related to data systems design
Replication
Performance
(also available in Papers folder)
-
ZooKeeper: Wait-free coordination for Internet-scale systems
-
An Empirical Evaluation of Columnar Storage Formats
-
Froid: Optimization of Imperative Programs in a Relational Database (K. Ramachandra, et al., VLDB 2017)
-
Don’t Hold My Data Hostage: A Case for Client Protocol Redesign (M. Raasveldt, et al., VLDB 2017)
-
An Overview of Query Optimization in Relational Systems (S. Chaudhuri, PODS 1998)
-
Dremel: A Decade of Interactive SQL Analysis at Web Scale (S. Melnik, et al., VLDB 2020)
-
Photon: A Fast Query Engine for Lakehouse Systems (A. Behm, et al., SIGMOD 2022)
-
The Snowflake Elastic Data Warehouse (B. Dageville, et al., SIGMOD 2016)
-
DuckDB: an Embeddable Analytical Database (M. Raasveldt, et al., SIGMOD 2019)
-
Yellowbrick: An Elastic Data Warehouse on Kubernetes (M. Cusack, et al., CIDR 2024)
-
Amazon Redshift Re-Invented (N. Armenatzoglou, et al., SIGMOD 2022)
-
SparkFuzz - searching correctness regressions in modern query engines
can describe following terms - state machine replication within a shard - two-phase locking for serializable transaction isolation - two-phase commit for cross-shard atomicity
Blogs
-
Grab, Enabling near real-time data analytics on the data lake (https://engineering.grab.com/enabling-near-realtime-data-analytics)
Terms
-
Table Formats (Iceberg, Hudi, Delta, OneTable, etc.).
-
Features (ACID transaction, Schema Evolution, Upsert, Time Travel, etc.).
-
Tech ( Copy On Write vs Merge On Read, Compaction, Vacuum, OCC (Optimistic Concurrency Control), MVCC (Multiversion Concurrency Control) )
Blogs and Videos
-
Uber
-
Redpanda
-
Substrate
-
AWS
RFCs
Blogs
-
Traveloka - Data Lake API on Microservice Architecture using BigQuery.
-
Best Practices? avoid giving direct access to data platform storage (object storage, database, etc.) as it creates a tight coupling to the underlying technology, format, etc. Instead, have an API layer in between to decouple that dependency.
-
What’s bad about direct access?
-
change coordination required between teams.
-
lack of access control (column, row levels).
-
lack of audit log (who access, download what).
-
-
-
Building Criteo API, What We’ve Learned (https://medium.com/criteo-engineering/building-criteo-api-what-weve-learned-b7f3e7b8d270). Key lessons learned after building a new API ecosystem from scratch.
-
Idempotency Keys: How PayPal and Stripe Prevent Duplicate Payment(https://medium.com/@sahintalha1/the-way-psps-such-as-paypal-stripe-and-adyen-prevent-duplicate-payment-idempotency-keys-615845c185bf)
-
How We Design Our APIs at Slack (https://slack.engineering/how-we-design-our-apis-at-slack/)
-
Grafana - How I write HTTP services in Go after 13 years (https://grafana.com/blog/2024/02/09/how-i-write-http-services-in-go-after-13-years/)
-
Introducing DoorDash’s In-House Search Engine (https://doordash.engineering/2024/02/27/introducing-doordashs-in-house-search-engine/)
Blogs
-
Agoda, How to Design and Maintain a High-Performing Data Pipeline
-
Data pipeline scalability: SLA, partioning, data freshness, resource usage, scheduling, data dependency, monitoring.
-
Data quality: freshness, integrity (uniqueness e.g. no dup keys), completeness (e.g. no empty, NULLS), accuracy (value is not abnormal by checking with previous trend, ThridEye), consistency (source = destination, Quilliup, running when pipeline completes).
-
Ensuring data quality: validating before writing to destination, testing, monitoring, alerting, responding, automatic Jira tickets creation.
-
-
Orchestrating Data/ML Workflows at Scale With Netflix Maestro
-
Netflix’s Dataflow: bootstrapping, standardization, automation of batch data pipelines
-
Uber
Papers
-
Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (apache/datafusion#6782). Written in Rust, uses Apache Arrow as memory model.
Projects
-
https://github.com/flyteorg/flyte: orchestrator
Metadata management, data quality, data veracity, data security, data lineage, etc.
-
Open Metadata (https://open-metadata.org/)
Blogs
-
[How Google, Uber, and Amazon Ensure High-Quality Data at Scale](https://medium.com/swlh/how-3-of-the-top-tech-companies-approach-data-quality-79c3146fd959)
-
[Uber - Monitoring Data Quality at Scale with Statistical Modeling](https://www.uber.com/en-VN/blog/monitoring-data-quality-at-scale)
-
[LinkedIn - Towards data quality management at LinkedIn](https://engineering.linkedin.com/blog/2022/towards-data-quality-management-at-linkedin)
-
[Data Quality: Timeseries Anomaly Detection at Scale with Thirdeye](https://medium.com/the-ab-tasty-tech-blog/data-quality-timeseries-anomaly-detection-at-scale-with-thirdeye-468f771154e6)
-
How we deal with Data Quality using Circuit Breakers (https://medium.com/@modern-cdo/taming-data-quality-with-circuit-breakers-dbe550d3ca78)
-
Lyft - From Big Data to Better Data: Ensuring Data Quality with Verity (https://eng.lyft.com/from-big-data-to-better-data-ensuring-data-quality-with-verity-a996b49343f6)
-
Data Quality Automation at Twitter (https://blog.x.com/engineering/en_us/topics/infrastructure/2022/data-quality-automation-at-twitter)
Papers
-
[VLDB, Amazon - Automating Large-Scale Data Quality Verification](https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). It presents the design choices and architecture of a production-grade system for checking data quality at scale, shows the evaluation result on some datasets.
Best Practices
-
too little data quality alerts let important issues go unresolved.
-
too many alerts overwhelms and might make the most important ones go un-noticed.
-
statistical modeling techniques (PCA, etc.) can be used to reduce computation resource for quality checks.
-
separate anomaly detection from anomaly scoring and alerting strategy.
Common Issues
-
issues in
metadata category
(data availability, data freshness, schema changes, data completeness) → can be obtained without checking dataset content -
issues in
semantic category
(dataset content: column value nullability, duplication, distribution, exceptional values, etc.) → needs data profiling
Blogs
-
Data Lineage at Slack (https://slack.engineering/data-lineage-at-slack/).
-
Lineage service exposes endpoints for ingestion, stores data in RDS.
-
Ingestion for Airflow DAGs built into existing dags using Airflow callbacks.
-
Ingestion for Presto dashboards: audit tables, SQL Parsing.
-
-
OpenLineage, open framework for data lineage collection and analysis (https://openlineage.io/)
-
How we compute data lineage at Criteo (https://medium.com/criteo-engineering/how-we-compute-data-lineage-at-criteo-b3f09fc5c577)
-
Yelp - Spark Data Lineage (https://engineeringblog.yelp.com/2022/08/spark-data-lineage.html)
-
Data Lineage: State-of-the-art and Implementation Challenges (https://medium.com/bliblidotcom-techblog/data-lineage-state-of-the-art-and-implementation-challenges-1ea8dccde9de)
GDPR, CCPA, PII Protection, etc.
-
Lyft - A Federated Approach To Providing User Privacy Rights (https://eng.lyft.com/a-federated-approach-to-providing-user-privacy-rights-3d9ab73441d9). Technical strategies for CCPA. Implementation of user data export and deletion. Federated design with central orchestration for exporting/ deleting.
-
Intuit - 10 lessons learned in operationalizing GDPR at scale (https://medium.com/ssdr-book/10-lessons-learned-in-operationalizing-gdpr-at-scale-7a41318846b6)
Blogs
-
How Agoda manages 1.8 trillion Events per day on Kafka (https://medium.com/agoda-engineering/how-agoda-manages-1-8-trillion-events-per-day-on-kafka-1d6c3f4a7ad1)
-
Apache Kafka Rebalance Protocol, or the magic behind your streams applications (https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2)
-
Featureflow: Democratizing ML for Agoda (https://medium.com/agoda-engineering/featureflow-democratizing-ml-for-agoda-aec7a6c45b30)
-
Challenge: time-consuming feature analysis, training, validation vs fast changing customers and competitors in travel industry; lacking of consistency from analysis to training, from feature development to deployment.
-
Solution: Featureflow with components (UI, data pipeline, monitoring, sandbox env, experiment platform)
-
Result: feature analysis reduced from a week to a day, quarterly experiments increased from 6 to 20, feature contributors from ~3 to ~50, larger feature pool, more robust feature screening process.
-
-
How ByteDance Scales Offline Inference with multi-modal LLMs to 200 TB Data (https://www.anyscale.com/blog/how-bytedance-scales-offline-inference-with-multi-modal-llms-to-200TB-data)
-
Building Real-time Machine Learning Foundations at Lyft (https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e)
Kubernetes
-
Lessons From Our 8 Years Of Kubernetes In Production (https://medium.com/@.anders/learnings-from-our-8-years-of-kubernetes-in-production-two-major-cluster-crashes-ditching-self-0257c09d36cd)
Terraform
-
Slack - How We Use Terraform At Slack (https://slack.engineering/how-we-use-terraform-at-slack/)
Network
-
Slack - Traffic 101: Packets Mostly Flow (https://slack.engineering/traffic-101-packets-mostly-flow/)
-
Slack - Continuous Load Testing (https://slack.engineering/continuous-load-testing/)
Observability, Monitoring
-
Observability @ Data Pipelines (https://medium.com/ssdr-book/observability-data-pipelines-99eda62b1704)
Incidents
-
Slack’s Incident on 2022-Feb-22 (https://slack.engineering/slacks-incident-on-2-22-22/)
-
Preparing for Interview at Agoda: interview process at Agoda with advices for candidates in each stage.
Architecture
-
A data architecture pattern to maximize the value of the Lakehouse (https://www.databricks.com/blog/data-architecture-pattern-maximize-value-lakehouse.html)
Data Pipelines
-
How to Evaluate Data Pipelines for Cost to Performance (https://www.databricks.com/blog/2020/11/13/how-to-evaluate-data-pipelines-for-cost-to-performance.html)
Spark and Databricks Compute
-
Advanced Topics on Spark Optimization and Debug (https://holdenk.github.io/spark-flowchart)
-
Example Code for High Performance Spark book (https://github.com/high-performance-spark/high-performance-spark-examples)
Delta Lake
-
[Managing Recalls with Barcode Traceability on the Delta Lake](https://www.databricks.com/blog/managing-recalls-barcode-traceability-delta-lake)
-
[Creating a Spark Streaming ETL pipeline with Delta Lake at Gousto](https://medium.com/gousto-engineering-techbrunch/creating-a-spark-streaming-etl-pipeline-with-delta-lake-at-gousto-6fcbce36eba6)
-
issues and solutions
-
costly Spark op
MSCK REPAIR TABLE
because it needs to scan table' sub-tree in S3 bucket. → useALTER TABLE ADD PARTITION
instead. -
not caching dataframes for multiple usages. → use cache
-
rewriting all destination table incl. old partitions when having a new partition. → append new partition to destination.
-
architecture (waiting for CI, Airflow triggering, EMR spinning up, job run, working with AWS console for logs) slowing down development. Min feedback loop of 20 minutes. → move away from EMR, adopt a platform allowing to have complete control of clusters and prototyping.
-
-
Databricks Pros
-
Reducing ETL time, latency from 2 hours to 15s by using streaming job and delta architecture.
-
Spark Structured Streaming Autoloader helps manage infra (setting up bucket noti, SNS and SQS in the background).
-
Notebook helps prototype on/ explore production data, debug with traceback and logs interactively. Then CICD to deploy when code is ready. This helps reduce dev cycle from 20 mins to seconds.
-
Costs remain the same as before Databricks. (using smaller instances with streaming cluster, which compensated for DBx higher costs vs EMR).
-
Reducing complexity in codebase and deployment (no Airflow).
-
Better ops: performance dashboards, Spark UI, reports.
-
-
Other topics: DBT for data modeling, Redshift, SSOT.
-
-
[Data Modeling Best Practices & Implementation on a Modern Lakehouse](https://www.databricks.com/blog/data-modeling-best-practices-implementation-modern-lakehouse)
Governance
-
Implementing the GDPR 'Right to be Forgotten' in Delta Lake (https://www.databricks.com/blog/2022/03/23/implementing-the-gdpr-right-to-be-forgotten-in-delta-lake.html) Approaches: 1-Data Amnesia, 2-Anonymization, 3-Pseudonymization/Normalized tables. Speed up point DELETE by data skipping optimization with Z-order on DELETE where fields.
Backfilling
-
Autoloader start and end date for ingestion (https://community.databricks.com/t5/data-engineering/autoloader-start-and-end-date-for-ingestion/td-p/45523)
Papers
-
NanoLog: A Nanosecond Scale Logging System. github repo
-
Implemented fast, low latency, high thruput C++ logging system.
-
10-100x faster than existing systems (Log4js, spdlog)
-
maintains printf-like semantics
-
80M msg/sec at median latency of 8ns in microbenchmarks, 18ns in apps
-
-
How?
-
shifting work out of the runtime hot path and into the compilation and post-execution phases of the application.
-
deferring formatting to an offline process.
-
-
Benefit: allows detailed logs in low latency systems
-
costs: 512KB RAM per thread, one core, disk BW.
-