/data-system-design

System Design, Solution Architecture, Data Systems Practice

Data Systems Design and Implementation

(also available in Papers folder)

 can describe following terms
- state machine replication within a shard
- two-phase locking for serializable transaction isolation
- two-phase commit for cross-shard atomicity

Blogs

Terms

  • Table Formats (Iceberg, Hudi, Delta, OneTable, etc.).

  • Features (ACID transaction, Schema Evolution, Upsert, Time Travel, etc.).

  • Tech ( Copy On Write vs Merge On Read, Compaction, Vacuum, OCC (Optimistic Concurrency Control), MVCC (Multiversion Concurrency Control) )

Blogs and Videos

RFCs

Blogs

Blogs

Papers

  • Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (apache/datafusion#6782). Written in Rust, uses Apache Arrow as memory model.

Projects

Metadata management, data quality, data veracity, data security, data lineage, etc.

Blogs

Papers

  • [VLDB, Amazon - Automating Large-Scale Data Quality Verification](https://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). It presents the design choices and architecture of a production-grade system for checking data quality at scale, shows the evaluation result on some datasets.

Best Practices

  • too little data quality alerts let important issues go unresolved.

  • too many alerts overwhelms and might make the most important ones go un-noticed.

  • statistical modeling techniques (PCA, etc.) can be used to reduce computation resource for quality checks.

  • separate anomaly detection from anomaly scoring and alerting strategy.

Common Issues

  • issues in metadata category (data availability, data freshness, schema changes, data completeness) → can be obtained without checking dataset content

  • issues in semantic category(dataset content: column value nullability, duplication, distribution, exceptional values, etc.) → needs data profiling

Blogs

GDPR, CCPA, PII Protection, etc.

Blogs

Kubernetes

Terraform

Network

Observability, Monitoring

Incidents

Architecture

Data Pipelines

Spark and Databricks Compute

Delta Lake

  • [Managing Recalls with Barcode Traceability on the Delta Lake](https://www.databricks.com/blog/managing-recalls-barcode-traceability-delta-lake)

  • [Creating a Spark Streaming ETL pipeline with Delta Lake at Gousto](https://medium.com/gousto-engineering-techbrunch/creating-a-spark-streaming-etl-pipeline-with-delta-lake-at-gousto-6fcbce36eba6)

    • issues and solutions

      • costly Spark op MSCK REPAIR TABLE because it needs to scan table' sub-tree in S3 bucket. → use ALTER TABLE ADD PARTITION instead.

      • not caching dataframes for multiple usages. → use cache

      • rewriting all destination table incl. old partitions when having a new partition. → append new partition to destination.

      • architecture (waiting for CI, Airflow triggering, EMR spinning up, job run, working with AWS console for logs) slowing down development. Min feedback loop of 20 minutes. → move away from EMR, adopt a platform allowing to have complete control of clusters and prototyping.

    • Databricks Pros

      • Reducing ETL time, latency from 2 hours to 15s by using streaming job and delta architecture.

      • Spark Structured Streaming Autoloader helps manage infra (setting up bucket noti, SNS and SQS in the background).

      • Notebook helps prototype on/ explore production data, debug with traceback and logs interactively. Then CICD to deploy when code is ready. This helps reduce dev cycle from 20 mins to seconds.

      • Costs remain the same as before Databricks. (using smaller instances with streaming cluster, which compensated for DBx higher costs vs EMR).

      • Reducing complexity in codebase and deployment (no Airflow).

      • Better ops: performance dashboards, Spark UI, reports.

    • Other topics: DBT for data modeling, Redshift, SSOT.

  • [Data Modeling Best Practices & Implementation on a Modern Lakehouse](https://www.databricks.com/blog/data-modeling-best-practices-implementation-modern-lakehouse)

Governance

Backfilling

Papers

  • NanoLog: A Nanosecond Scale Logging System. github repo

    • Implemented fast, low latency, high thruput C++ logging system.

      • 10-100x faster than existing systems (Log4js, spdlog)

      • maintains printf-like semantics

      • 80M msg/sec at median latency of 8ns in microbenchmarks, 18ns in apps

    • How?

      • shifting work out of the runtime hot path and into the compilation and post-execution phases of the application.

      • deferring formatting to an offline process.

    • Benefit: allows detailed logs in low latency systems

    • costs: 512KB RAM per thread, one core, disk BW.