/Data.Engineers.Lunch

Resources from weekly Zoom lunches revolving around Data Engineering. Hosted by Anant Corporation.

Data.Engineers.Lunch

Resources from weekly Zoom lunches revolving around data engineering and data engineering-related topics. Hosted by Anant Corporation.

Join Data Engineer's Lunch Weekly at 12 PM EST Every Monday

Watch Data Engineer's Lunches Live and Subscribe to Our YouTube Channel to Keep Up to Date

If you would like to be a guest speaker, you can reach us at solutions@anant.us. If you would like to sponsor Data Engineer's Lunch, please reach us at the email listed.

Check out the Data Engineer's Lunch playlist on Youtube

Table of Contents

Number Jump To Topic YouTube SlideShare
1 Data Engineering Roadmap YouTube SlideShare
2 Common ETL Frameworks YouTube SlideShare
3 Scripting Shell Automation for Data Engineering YouTube SlideShare
4 Airflow for Data Engineering YouTube SlideShare
5 What is a Data Lake YouTube SlideShare
6 Common Data Formats Used In Data Engineering YouTube SlideShare
7 SQL Databases YouTube SlideShare
8 SQL Databases Part 2 YouTube SlideShare
9 Open Source & Cloud Data Catalog YouTube SlideShare
10 NoSQL Databases: Part 1 YouTube SlideShare
11 Apache Spark Companion Technologies MLFlow YouTube SlideShare
12 Introduction to sed for Data Engineering YouTube SlideShare
13 Introduction to Airflow YouTube SlideShare
14 NoSQL Databases: Part 2 CAP Theorem YouTube SlideShare
15 Introduction to Jenkins YouTube SlideShare
16 Introduction to awk for Data Engineering YouTube SlideShare
17 NoSQL Databases: Part 3 Data Store Types YouTube SlideShare
18 Luigi for Scheduling YouTube SlideShare
19 Introduction to jq for Data Engineering YouTube SlideShare
20 DataOps vs. DevOps YouTube SlideShare
21 Python ETL Tools YouTube SlideShare
22 Prometheus YouTube SlideShare
23 Thanos/Cortex YouTube SlideShare
24 Pandas for Data Engineering YouTube SlideShare
25 Airflow and Spark YouTube SlideShare
26 Akka Actors for Data Processing YouTube SlideShare
27 Data Processing with Containers: Docker & Kubernetes Tools for Data Engineering YouTube SlideShare
28 Petl for Data Engineering YouTube SlideShare
29 Introduction to Apache Nifi YouTube SlideShare
30 Databand YouTube SlideShare
31 Migrating from PostgreSQL to Cassandra YouTube SlideShare
32 Converting JSON to CSV YouTube SlideShare
33 Using Spark, Cassandra, and Elasticsearch for Data Processing YouTube SlideShare
34 DBeaver YouTube SlideShare
35 Introduction to Snowflake YouTube SlideShare
36 Amundsen/DSE + Airflow YouTube SlideShare
37 Pipedream: Serverless Integration and Compute Platform YouTube SlideShare
39 Dapr Cloud YouTube SlideShare
40 Streaming Real Time vs Batch for ETL YouTube SlideShare
41 PygramETL YouTube SlideShare
42 Introduction to Databricks YouTube SlideShare
43 Bodo.ai - Karthik Narayanan YouTube
44 Prefect YouTube SlideShare
45 Apache Livy YouTube SlideShare
46 Node.js and API calls YouTube SlideShare
47 Airflow on Kubernetes YouTube SlideShare
48 Veezoo - João Pedro Monteiro YouTube
49 Meltano for Data Engineering YouTube SlideShare
50 Airbyte for Data Engineering YouTube SlideShare
51 Comparison of Managed Airflow Options YouTube
52 JupyterHub/JupyterLab on Kubernetes YouTube SlideShare
53 2021 in Review YouTube
54 dbt and Spark YouTube SlideShare
55 Get Started in Data Engineering YouTube SlideShare
56 Spring Cloud Data Flow with Cassandra YouTube SlideShare
57 StreamSets for Data Engineering YouTube SlideShare
58 InfinyOn YouTube
59 Spark Tasks and Distribution YouTube SlideShare
60 Series - Developing Enterprise Consciousness YouTube SlideShare
61 Kubevirt YouTube SlideShare
63 Building a Cryptocurrency Data Catalogue YouTube SlideShare
64 Processing Real-time Crypto Transactions YouTube
65 JanusGraph on Jupyter - Using Notebooks with Graph YouTube
66 Airflow and Presto YouTube SlideShare
67 Machine Learning - Feature Selection YouTube SlideShare
68 DevOps Fundamentals YouTube SlideShare
69 Great Expectations for Data Engineering SlideShare
70 Apache Iceberg YouTube SlideShare
71 Tools for Cloud Data Engineering YouTube
72 Introduction to Apache Pinot YouTube
74 Table Format Comparison YouTube
75 Real-time change data capture processing and ingest into OLTP and OLAP databases YouTube
76 Airflow and Google Dataproc YouTube
77 Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Databases YouTube
78 Visualize Data from Cassandra in Superset YouTube
79 The Second 90% of Data Engineering Projects YouTube
80 Apache Spark Resource Managers
81 Reverse ETL Tools for Modern Data Platforms YouTube

  • We cover the data engineering roadmap and the general path, which includes various technologies for programming, scripting/automation, databases, data processing, scheduling, clouds, and infrastructure. We also discuss different guides and resources.

  • We discuss common ETL frameworks and different tools and frameworks for different languages including Python, Java, Scala, .NET, and Node.

  • We discuss a multitude of tools you can use to do scripting and shell automation for data engineering along with different shells, cron, and various command-line tools with resources and examples.

  • Guest speaker Will Angel covers the topic of using Airflow for data engineering. Airflow is a scheduling tool for managing data pipelines.

  • We discuss what data lakes are, why we need them, how we get data in and out, and different implementations of data lakes.

  • We discuss common data formats used in data engineering including text/file and binary formats.

  • We discuss relational concepts including the history of RDBMS, the general need for SQL databases, rules of design, and normalization. We also discuss popular SQL databases, and their advantages and disadvantages.

  • We continue our discussion of relational concepts, popular SQL databases, and advantages and disadvantages. We also discuss Cloud Databases and database tools compatible with SQL databases.


  • We discuss NoSQL datastores, specifically, different types of key-value stores.

  • We cover MLFlow, a tool by Databricks for managing and cataloging machine learning workflows.

  • We will introduce sed, a stream editor, for data engineering. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

  • We will cover some resources for getting started with Airflow, a python based scheduling tool with the ability to connect to a number of different data management tools. We had an overview recently from Will Angel in Data Engineers Lunch #4. This session will help beginners learn to use Airflow.

  • We cover the fundamental difference between relational vs most non-relation databases with ACID vs Base.

  • We will cover the use of Jenkins as a scheduling tool, have a general overview of Jenkins capabilities, and a comparison of how it stacks up against Airflow as a scheduling tool.

  • We will introduce and demonstrate awk, a program that you can use to select particular records in a file and perform operations upon them.

  • We discussed the four different types of data stores that underlie NoSQL databases.

  • We discussed Luigi as a scheduling platforms alongside our previous discussions of Jenkins and Airflow. Luigi is a Python package that helps you build complex pipelines of batch jobs.

  • We introduce jq and how we can use it for data engineering. jq is a command-line tool like sed for JSON data and can be used to slice, filter, map, and transform structured data.

  • We discuss the definitions and differences between DataOps (Data Operations) and DevOps (Dev Operations).

  • We discuss, compare, and contrast a number of ETL tools for Python.

  • Guest speaker Will Angel covers the topic of using Prometheus for data engineering. Prometheus is a monitoring system & time series database.


  • We continue our discussion of Python ETL tools with a more in-depth look at Pandas.


  • We discuss how to use Akka Actors for concurrent data processing operations.


  • We continue our discussion of Python ETL tools with a more in-depth look at Petl.

  • We introduce Apache Nifi and discuss how we can use it for data engineering.

  • In Data Engineer’s Lunch #30 we discuss the differences between the open-source and paid versions of Databand and have Databand CEO Josh Benamram walk us through a demo of the paid version.

  • In Data Engineer's Lunch #31, we will discuss the process and reasons for migrating your database from SQL(PostgreSQL) to NoSQL(Cassandra)

  • In Data Engineer's Lunch #32, we will discuss different ways to convert json files into csv files.

  • In Data Engineer's Lunch #33, we will discuss how you can use Spark and Spark jobs to load data from a csv file, and save + load the data into Cassandra and Elasticsearch.

  • In Data Engineer's Lunch #34: DBeaver, we will be discussing what DBeaver is and how it can be used in data engineering.

  • In Data Engineer's Lunch #35: Introduction to Snowflake, we will introduce Snowflake and discuss how it can be used for Data Engineering.

  • In Data Engineer's Lunch #36, we will discuss data discovery with Amundsen.

  • In Data Engineer's Lunch #37, we will discuss Pipedream, a serverless integration and compute platform that is free for individual developers to use.

  • In Data Engineer's Lunch #39: Dapr Cloud we will discuss how to use Dapr to make a cloud Application

  • In Data Engineer's Lunch #40: Streaming Real Time vs Batch for ETL, we will be discussing use cases for using real time stream processing or processing in batches.

  • In Data Engineer's Lunch #41, we will discuss pygrametl as part of our discussion of python ETL tools.

  • In Data Engineer's Lunch #42, we will introduce Databricks and how it can be used for data engineering.

Data Engineer's Lunch #43: Bodo.ai - Karthik Narayanan

  • In Data Engineer's Lunch #43, Karthik Narayanan Principal Solutions Architect and Bodo.ai will be demonstrating what Bodo.ai is and its capabilities.

  • In Data Engineer's Lunch #44, we will discuss Prefect and how it compares to Airflow when scheduling tasks.

  • In Data Engineer's Lunch #45, we will discuss the use of Apache Livy, which creates a REST API for interacting with Spark.

  • In Data Engineer's Lunch #46, we discuss the architecture of Node.js and use it to initiate and harvest some data from an API call.

  • In Data Engineer's Lunch #47, we will use Kubernetes to deploy airflow

Data Engineer's Lunch #48: Veezoo - João Pedro Monteiro

  • In Data Engineer's Lunch #48, João Pedro Monteiro (JP), co-founder and CTO of Veezoo, will be introducing Veezoo and showing how natural language interfaces are the key to enabling data democratization at companies.

  • In Data Engineer's Lunch #49, we will be introducing Meltano and how it can be used for ELT in data engineering.

  • In Data Engineer's Lunch #50, we will introduce Airbyte and discuss how it can be used for data engineering

Data Engineer's Lunch #51: Comparison of Managed Airflow Options

  • In Data Engineer's Lunch #51: Comparison of Managed Airflow Options, guest speaker Andres Namm will be comparing AWS Airflow, GCP Airflow, Astronomer vs. self-managed Airflow.

Data Engineer's Lunch #52: JupyterHub/JupyterLab on Kubernetes

  • In Data Engineer's Lunch #52 we will deploy JupyterHub/JupyterLab on Kubernetes

Data Engineer's Lunch #53: 2021 in Review

  • In Data Engineer's Lunch #53, we discussed some of our most popular webinars from 2021 and received feedback from the audience about what they would like to see in 2022.

  • In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations.

Data Engineer's Lunch #55: Get Started in Data Engineering

  • In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.

Data Engineer's Lunch #56: Spring Cloud Data Flow with Cassandra

  • In Data Engineer's Lunch #55 we will be going over how to integrate Spring Cloud Data Flow with Cassandra.

  • In Data Engineer's Lunch #57, we will discuss StreamSets and how it can be used for data engineering.

Data Engineer's Lunch #58: InfinyOn

  • In Data Engineer’s Lunch #58, Sehyo Chang, founder and CTO of InfinyOn, will give an introduction to Fluvio OSS and the InfinyOn Cloud data streaming platform.

  • In Data Engineer's Lunch #59, we will discuss the way that Spark splits up and distributes work between nodes. We will look at some example code and view in the Spark UI, how it was distributed between nodes.

Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness

  • In Data Engineer's Lunch #60, CEO of Anant, Rahul Singh, will discuss modern data processing / pipeline approaches. Want to learn about modern data engineering patterns & practices for global data platforms? High-level overview of different types, frameworks, and workflows in data processing and pipeline design.

Data Engineer's Lunch #61: Kubevirt

  • In Data Engineer's Lunch #61, Stefan Nikolovski will discuss Kubevirt.

Data Engineer's Lunch #63: Building a Cryptocurrency Data Catalogue

  • In Data Engineer’s Lunch #63, Travis Collins, founder of the open source project DataPM, will present DataPM, how to get access to cryptocurrency, and blockchain data. This is part 1 of a series with Decodable on processing real-time crypto transactions fed by DataPM.

Data Engineer's Lunch #64: Processing Real-time Crypto Transactions

  • In Data Engineer’s Lunch #64, Eric Sammer, CEO of Decodable, will discuss their cloud-based streaming SQL engine and how to mine insights from data in real-time. This is part 2 of a series with DataPM on processing real-time crypto transactions fed by DataPM.

  • In Data Engineer's Lunch #65, Ryan Quey will discuss the Graph Notebook tool put out by the AWS team on JanusGraph.

  • In Data Engineer's Lunch #66, Arpan Patel will discuss how to connect Airflow and Presto

Data Engineer's Lunch #67: Machine Learning - Feature Selection

  • In Data Engineer's Lunch #67, Obioma Anomnachi will discuss the process of feature selection as part of a Machine Learning process. Feature selection describes the process of picking particular, relevant data features out of a wider data set, to be used to perform model training.

Data Engineer's Lunch #68: DevOps Fundamentals

  • In Data Engineer’s Lunch #68, Will Angel, Technical Product Manager at Caribou Financial, will provide an introduction to DevOps practices and tooling including testing, deployment automation, logging, monitoring, and DevOps principles. Additionally, we will discuss some of the ways that DevOps for data engineering is different from conventional application development.

Data Engineer's Lunch #69: Great Expectations for Data Engineering

  • In Data Engineer's Lunch #69, Arpan Patel will discuss Great Expectations and how it can be used for data engineering. This will be part one of a series on Great Expectations and will primarily focus on introducing Great Expectations. Future talks will feature tools like Spark and Airflow in conjunction with Great Expectations!

Data Engineer's Lunch #70: Apache Iceberg

  • In Data Engineer's Lunch #70, watch Alex Merced, Developer Advocate at Dremio, for this webinar to learn the architectural details of why the Hive table format falls short and why the Iceberg table format resolves them, as well as the benefits that stem from Iceberg’s approach.

Data Engineer's Lunch #71: Tools for Cloud Data Engineering

  • In Data Engineer’s Lunch #71, CEO of Anant, Rahul Singh, will discuss tools for cloud data engineering!

Data Engineer's Lunch #72: Introduction to Apache Pinot

  • In Data Engineer’s Lunch #72, CEO of Anant, Rahul Singh, will give an overview of the up and coming Apache Pinot project that spun out of LinkedIn and is now being supported by Startree as an enterprise offering. This is the first in a series of talks and workshops on why Pinot is important to the future of real-time data

Data Engineer's Lunch #74: Table Format Comparison

  • In Data Engineer's Lunch #74, Alex Merced, Developer Advocate for Dremio, will discuss the three major data lake table formats – Apache Iceberg, Apache Hudi, and Delta Lake – covering how they work, their features, and their limitations so you can make an informed decision when architecting your data lakehouse.

Data Engineer's Lunch #75: Real-time change data capture, processing, and ingest into OLTP and OLAP databases

  • In Data Engineer's Lunch #75, Eric Sammer, CEO of Decodable, will discuss real-time change data capture, processing, and ingest into OLTP and OLAP databases!

Data Engineer's Lunch #76: Airflow and Google Dataproc

  • In Data Engineer's Lunch #76, Arpan Patel will cover how to connect Airflow and Dataproc with a demo using an Airflow DAG to create a Dataproc cluster, submit an Apache Spark job to Dataproc, and destroy the Dataproc cluster upon completion.

Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for High-Performance Data Transfers from Databases

  • This talk covers why ODBC & JDBC don’t cut it in today’s data world and the problems solved by Arrow, Arrow Flight, and Arrow Flight SQL. Alex will go through how each of these building blocks works as well as an overview of universal ODBC & JDBC drivers built on Arrow Flight SQL, enabling clients to take advantage of this increased performance with zero application changes.

Data Engineer's Lunch #78: Visualize Data from Cassandra in Superset

  • In this lunch, Ryan will walk through how to visualize data from Cassandra in Superset (by means of Presto). Along the way, he shares some observations about his experience and potential use cases that may be interesting to you.

Data Engineer's Lunch #79: Data Governance: The Second 90% of Data Engineering Projects

  • You build an ELT pipeline to get data from some source, load it into your data lake, and transform it into a usefully modeled dataset for analysts and business users to consume; another data engineering job well done. Except you now have a new set of data artifacts, access patterns, documentation (hopefully), and security permissions to manage. This talk will provide an overview of Data Governance, which is the art of anticipating, preventing, and mitigating all the risks, costs, and headaches that come with every new data source throughout the data lifecycle.

Data Engineer's Lunch #80: Apache Spark Resource Managers

  • In Data Engineer's Lunch #80, Obioma Anomnachi will compare and contrast the different resource managers available for Apache Spark. We will cover local, standalone, YARN, and Kubernetes resource managers and discuss how each one allows the user different levels of control over how resources given to spark are distributed to Spark applications.

Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms

  • During this lunch, we’ll review some of the open source reverse ETL tools to uncover how to send data back to SaaS systems.