/DataEngineering101

Yu Long's note about data engineering

Primary LanguageJupyter Notebook

What is it?

  • A similar book like DataScience101 which is a personal note to be a better machine learning engineer.

  • 50+ notes so far, continously updating.

Course & Introduction

data_engineering_brief_intro_by_google

data engineer roadmap 2021

awesome data engineering tools

Emerging Architectures for Modern Data Infrastructure 2022

Books

Designing Data-Intensive Applications

Hands On Course

Youtube - hands on data engineering on gcp

Google - data engineering course

Datacomp - Data Engineer with Python

udemy - Data Engineering on Google Cloud platform

CSXXX

CS246 Mining Massive datasets - Notes

Note of CS246 Mining Massive Datasets

Note of CS329S Machine Learning System Design

Theorem

Designing Data-Intensive Applications

Chp1 可靠、可擴展與可維護的資料系統

Chp2 資料模型與查詢語言

Chp3 資料儲存與檢索

Database Fundamental

CAP theorem

OLTP vs OLAP (database vs data warehouse)

Dimentional Modeling (Star Schema)

date warehouse, datalake, datamesh and other buzzyword

Data Processing

Lambda and Kappa Architecture

Computational Framework Survey

Batch

Spark - installation

pyspark 101

RAPIDS for spark

Streaming

data ingestion

streamming framework survey

spark streaming introduction

structured streaming introduction

case study - near realtime arct for recommender in LinkedIn

realtime mvp for recommendation from Chip Huyen

Cloud Logging

Pipeline Management (ETL Management)

data piepline 101 - I - mirroring

data piepline 101 - II - partition mirroring

data piepline 101 - II - accumulated mirroring

data piepline 101 - III - etl, elt

data piepline 101 - IV - pipeline design - functionality

data piepline 101 - V - Idempotency

data piepline 101 - VI - Guard

data piepline 101 - VII - Checkpoint, Security, Accounts

data pipeline 101 - IIX - etl development

schema-changable system

Data Goverance

data goverance

metadata management

Google Cloud Platform

GCP command

GCP data_lake_warehouse

GCP BigQuery

GCP streamming

Google App Engine

Google Kubernetes Engine Introduction

Google Kubernetes Getting Start

VPC

PubSub

Google Kubernetes Engine Introduction

Google Kubernetes Getting Start

VPC

IAM

Kubernates

Kubernetes for the Absolute Beginners - Hands-on

BigData Algorithm

Storage

lsh family

join algorithm

Relational databases

MySQL install and python connector

database wrapper sqlalchemy, pymysql, pyodbc

Basic sql injection

sql hint

sql 101

Non-relational databases

Document

ElasticSearch 101

Graph

Key-Value

redis 101

Wide Column

Workflow Scheduling

airflow 101

other python scheduler

Crawler

Coding stuff

python crawler packages

Web Analysis

web analysis hits

HTML Tags

GraphQL

Cache

Introduction

Data Ingestion & Store

sync tables from database

Primary Key, Index and Partition

CI/CD

Introduction & Circle CI

Data Parsing / Cleaning 101

data cleaning for traffic analysis

Infra101

data science workstation

A/B, MAB Experiment Platform