- A very normal Data Engineering work ๐
- What can go wrong in Distributed Data Systems
- The Berkeley View on Cloud Computing - Paper
- Dolt is Git for Data ๐
- Everything Around PySpark Pandas UDF ๐
- Architect and build an #machinelearning use case end to end using Amazon SageMaker ๐
- Around Data Discovery or Metadata Management Platforms
- Amazon S3 Object Lambda - Provide Different Views of Data to Multiple Applications
- The Google File System - The Paper ๐
- Toward Better Data Culture From First Principles by Ube
- Getting started with #dataengineering Volume 6 ๐
- Getting started with Dataengineering Volume 5 ๐
- Getting started with Data Engineering, volume 4 ๐๐ก
- Getting started with Data Engineering, volume 3 ๐๐ก
- Getting started with Data Engineering, volume 2 ๐๐ก
- Getting started with Data Engineering, volume 1 ๐๐ก
- Apache Airflow 2.0
- Some Interesting essentials while learning Apache Airflow
- Dagster Release 0.10.0 - Everything about Exactly-once, Fault-Tolerant Scheduling - Extremely Important Release ๐๐๐
- #getdbt or Data Build Tools interface across all major Data Workflow Management Platform ๐ฏโจ๐ฅ
- Apache Superset - An #opensource Fully Featured Business Intelligence Application ๐๐๐
- The Hop Orchestration Platform, or Apache #Hop (Incubating), aims to facilitate all aspects of data and metadata orchestration ๐ฏ๐กโญ
- Apache Iceberg Partitioning is way better than Hive ! Hidden Partitioning makes everything easier! ๐
- Trino aka #prestosql is different from Apache Spark SQL - Exclusively designed for Distributed SQL ๐
- Apache Spark is NOT a Map Reduce but an MPP/MPI Engine
- Apache Hudi - Design Principles
- OpenTelemetry specification V1.0
- DataEngg Skills to work with DataScience
- Data Quality, A necessity for Data Driven Projects
- Essential Cloud Skills for Data Engineering
- Open Source Technologies in Data Engineering
- Kubernetes Fundamentals Required as a Data Engineer
- Apache Superset, OSS Business Intelligence for 2021
- #apachekafka as a Database - Summary on both the sides , Arguments, Trade-offs & exceptional ๐ฌ quotes โณ๐กโณ
- Processing Guarantees in #apachekafka ๐ฏ๐๐ - The best resource
- Change Data Analysis with Debezium and Apache Pinot ๐๐ก๐ฟ
- Optimizing Apache Kafka Producers & Consumers ๐๐๐
- Redpanda -A NON-JVM Streaming Platform for mission critical workloads ๐ก๐๐
- Apache Hudi - Turn Batch Jobs to Incremental Model | Complete file management on a Data Lake
- Apache Iceberg - an open table format for huge analytic datasets
- Ballista - Distributed computing platform built primarily on Rust and powered by Apache Arrow
- ZooKeeper, a distributed, open-source coordination service for distributed applications
- Apache Iceberg - Partition Evolution, its simple but its so amazing
- A Data Engineering Story - The Beginning
- Data Engineering - More towards Data Science or Data Analytics or ...
- Data Engineering Interview Patterns
- Basic Checklists while learning Apache Spark
- #apachespark for Distributed Analytics or #businessinteligence Platform - Worth or not ?
- Apache Beam for Search: An Introduction & Addressing the challenge of the Time Problem ๐๐ก๐
- Nextflow is a Workflow Manager exclusively for #bioinformatics ๐ฉน๐๐ฉน
- #apachespark Project Zen Update - Making PySpark Better ๐ก๐๐ก
- Design - Exactly Once Delivery & Transactional Messaging in #apachekafka ๐๐๐
- underrated but important skill of a Data Engineer
- Fallacies of Distributed Systems
- As a Data Engineer, some Essentials I did which really helped Data Scientists and the Team
- SQL Database on Kubernetes - Best Practices
- Devtron - An Open Source DevOps on Kubernetes, written in Go ๐ฅ๐๐
- Most Popular #opensource BI & Data Analytics Platforms ๐๐ก๐
- datapipelines Dataframe APi is now available with #apachebeam ๐ฏ๐ฅ๐ฏ
- Disaster Recovery for Multi-Region Apache Kafka & Data Consumption using #apacheflink ๐ ๐๐
- Kubernetes Api Structure ๐ฏโ๏ธ๐ฏ
- Architecting a Kubernetes Infrastructure ๐ฏ
- Exploring Kubernetes Operator Pattern ๐ก
- Docker is an interal part of Data Engineering ML pipeline & that makes security ๐ extremely essential
- Rack awareness for #apachekafka Streams Proposal ๐
- Machine Learning Workflow ๐ฏ
- Dummy Notes On Machine Learning Infrastructire
- Machine Learning Feature Store ๐ฏ
- Deploying #machinelearning model in Production is really HARD but #MLOps can fix that.
- List of #machinelearning & #dataengineering Technologies will be following in 2021 ๐๐ก๐
- MLOps - ZenML #machinelearning with reproducible pipelines โ ๐ฏโ
- Streamlit Healthcare Machine Learning Data App
- Dstack AI - An open-source tool to develop data applications with Python ๐๐ญ๐
- Adversarial Robustness Toolbox - a Python library for #machinelearning Security ๐ก๐๐
- Biopython is a set of freely available tools for biological computation written in #Python ๐โ๏ธ๐
- Time to Know More about DASK
- DataEngineering vs Machine Learning
- A good #machinelearning Model is only possible with a good quality of #data. โ๏ธ
- Statistics for #softwareengineer ๐ฅ๐ฏ๐ฅ
- Monitoring #machinelearning Applications ๐๐ ๐
- Dagster is a data orchestrator for machine learning, analytics, and ETL - Officially #machinelearning driven ๐ฅ๐ฅ๐ฅ
- Short Notes on -Open source #machinelearning Tracking System
- The best example of Randomness is - #machinelearning model in Production. ๐๐ญ๐
- Flyte is declarative, structured, and highly scalable cloud-native workflow orchestration platform for Distributed Machine Learning
- The Snowflake Paper - Core idea is to build an enterprise-ready #datawarehouse solution for the #cloud ๐๐ฐ๐
- Most important points around Distributed #dataengineering Platform
- Fundamental of #distributedsystems Scaling - Avoiding Co-ordination ๐โจ๏ธ๐
- Technical Debt in #dataengineering #softwareengineering ๐๐ก๐
- Paper on Wander Join: Online Aggregation via Random Walks ๐๐ญ๐ Join problem
- The Delta Lake Paper - High-Performance ACID Table Storage ๐๐ก๐
- Dynamo - AWS Highly Available Key-value Store #distributedsystem ๐ฌ๐ก๐
- An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, A Single SQL for all ๐ก๐ฉ๐ฉ
- Secure & Robust Machine Learning in #healthcare ๐๐งช๐ฅณ
- Progress in Medical Science using #deeplearning ๐๐ก๐
- The Amazon Redshift Paper - A fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing #businessintelligence tools ๐๐ฐ๐ญ
- Advancing #drugdiscovery via Artificial Intelligence ๐๐ฅ๐ฅ
- Apache Calcite is a dynamic data management framework ๐๐๐
- Lakehouse - A Paper on new Generation of #datawarehouse technology ๐ก๐๐ก
- Calvin: Fast Distributed Transactions for Partitioned Database Systems ๐๐
- Presto or Trino - #SQL on Everything ( The Design, Motivation & Performance) #presto ๐ญ๐๐ก
- Design - Exactly Once Delivery & Transactional Messaging in Apache Kafka
- Apache Kafka Paper : Distributed Messaging System for Log Processing
- Paper: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size
- Paper: Ground is an open-source data context service, a system to manage all the information that informs the use of data
- Azure Data Lake Store(ADLS) is a fully-managed, elastic, scalable, and secure file system that supports #hadoop distributed file system (HDFS) and Cosmos semantics
- An LFU (Least Frequently Used) Cache eviction algorithm of O(1) Runtime complexity