This Repository has Cloud Data Engineering Training Materials developed by Myla Ram Reddy.
Please contact Renuka for Training and Exam DP-203: Data Engineering on Microsoft Azure @ 8374899166(whatsapp)
- Install Anaconda
- understand markdown language
- How to write Python code in normal notepad
- How to write Python code in spyder
- How to write Python code in Visual Studio Code
- How to write Python code in in jupyter/ JupyterLab
- Different Python Objects
- int
- float
- complex
- str
- bool
- range
- Data Structures
- list
- Dict
- Tuple
- Set
- Mutable Vs Immutable
- Read items of str /list/Dict/Tuple/Set/range ..etc
- index
- slice
- fancy
- Operators
- Comparision(>,<,>=,<=,...)
- Logical/bool(and/or/not)
- Numpy logical (logical_and/logical_or/logical_not)
- Control Flows
- input
- if elif elif ... else
- while loop
- break
- continue
- for loop
- System_Defined_Functions
- create functions
- function parameter
- manadatory parameters
- optional parameters
- flexiable parameters
- key value flexiable parameters
- LEGB_scope_of_objects_of_functions
- Methods
- Modules
- User_defined_packages
- system_defined_packages
- Iterables & Iterators
- Lambda_Functions
- Syntax Errors and Exceptions
- List comprehensions
- OOPs_Introduction_Classes_Objects_Attributes_Methods
- OOPs_Inheritance_and_MRO
- OOPs_Encapsulation
- OOPs_Polymorphism
- BigData Introduction
- What is BigData
- BigData properties
- When to choose bigdata
- BigData VM Installation
- Oracle Virtual box installation
- Cloudera VM installation
- winscp Installation
- Putty Installation
- Linux commands
- Working with folders
- create folder
- remove folder with files
- remove folder without files
- understanding VI editor
- working with Files
- create a file
- copy file
- move file
- remove file
- cat command
- understanding permissions
- grep command
- find command
- ... etc
- HDFS
- mkdir command
- put command
- get command
- CopyFromLocal command
- CopyToLocal command
- rm Command
- merge command
- ... etc
- Hive
- Hive Metastore
- Hive Managed Tables
- Hive External Tables
- Hive Operations
- Hadoop File Formats and its Types
- Different ways to connecting hive
- Partitioning
- Bucketing
- Sqoop
- Sqoop Introduction
- sqoop list-tables
- Sqoop Eval
- Sqoop Import
- Sqoop Export
- Import All Tables
- Import table from mysql to hive
- Pyspark
- Spark Introduction
- Spark Architecture
- Spark Environment Setup (optional)
- Spark RDD with Python
- Spark RDD with Scala
- Spark DF
- Spark SQL
- Spark Structured Streaming
- Introduction
- ETL Introduction.
- ELT Introduction
- Different ETL Tools
- Azure Data Factory Introduction
- Azure Data Factory - Important Concepts in ADF
- ADF Architecture
- Create Azure Free Account with credit card
- Create Azure Free Account with out credit card
- Storage Account
- Introduction
- What is subscription
- What is resource group
- create resource group
- Create Storage Account
- Differences among LRS/GRS/ZRS/GZRS
- Difference between Hot and Cool Tiers
- Create Data Lake Gen 2
- Create Containers
- Create Folders
- Upload Files
- Override Files
- Download Files
- Edit Files
- Preview Files in different formats
- Azure SQL Database
- Create SQL Database
- Create Sql Server
- Create Username and password
- Allow Azure resources and selected IPS access
- Create tables and insert data
- Query Tables
- Install SSMS
- Access Azure SQL Database using SSMS
- Linked Service
- Create Linked Service to BLOB
- Create Linked Service to Azure SQL Database
- Create Linked Service to MSFT SQL Server
- Create Linked Service to Batch Account
- .... etc
- Test Linked Service Connection
- Integration Run Times
- What is Integration Run Time
- Types of IRs
- Azure integration runtime.
- Self-hosted integration runtime.
- Azure-SQL Server Integration Services (SSIS) integration runtime.
- Install Self-Hosted IR
- Configuration of Self-Hosted IR
- DataSets
- Create Source Datasets
- Create Sink Datasets -
- Preview data
- Create Lookup datasets
- Understand and preview data
- BLOB to BLOB Pipeline
- Create Pipeline
- Map source Dataset
- Map Sink Dataset
- Debug
- Trigger
- Understand output of run steps
- Understand Json log in each step
- Azure Storage Account Integration with ADF
- Copy multiple files from blob to blob
- Filter activity - Dynamic Copy Activity
- Get File Names from Folder Dynamically
- Copy Activity Behavior in ADF
- Copy Activity Performance Tuning in ADF
- Get Count of files from folder in ADF
- Validate copied data between source and sink in ADF
- Azure SQL Database integration with ADF
- Azure SQL Databases - Introduction - Relational databases in Azure
- Overwrite and Append Modes in Copy Activity in ADF
- Incremental Load
- What is full load
- What is incremental load
- types of incremental loads
- Incrementally load data from Azure SQL Database to Azure Blob storage
- Incrementally load data from multiple tables in SQL Server to a database in Azure SQL Database
- Incrementally copy new and changed files based on LastModifiedDate
- Incrementally copy new files based on time partitioned file name
- Logic Apps
- Send Succeeded mail of ADF pipeline with run stats
- Send Failed mail of ADF pipeline with error message
- Branching and chaining activities
- Azure Devops
- Create organization
- create project
- create Git main branch
- configure Git to ADF
- create a branch in ADF
- publish ADF work in Git branch
- delete git branches
- understand commit in git
- understand and debug merge conflicts
- DBFS(DataBricks File System)
- What is DBFS
- Navigate around DBFS
- Understanding path of DBFS
- Compute (creating clusters)
- what is cluster
- create cluster
- map cluster to notebook
- Workspace (Creating notebooks and working with notebooks)
- Understand workspace
- create folders
- organize content in the workspace
- Spark Introduction
- Spark Architecture
- Creating RDDs (Reslient Distributed Dataset)
- what is RDD
- create RDD
- Query RDD
- Creating DataFrame
- what is DF
- create DF
- add columns to DF
- drop columns from DF
- query required data from DF
- .. etc
- Reading and writing the Data From semi-structrured formats
- Reading JSON Files SingleLine/ MultiLine / Complex
- Reading XML Files
- Reading CSV / TSV Files
- Reading and writing the Data From structrured formats
- Reading data from MySql / SQL SERVER / Oracle etc..
- Reading and writing the Data From BIG DATA formats
- Parquet
- ORC
- AVRO
- ... etc
-
Reading and writing the Data From AWS S3
-
Reading and writing the Data From Azure Blob
-
PySpark Joins
-
PySpark Union / UnionAll
-
Scopes
-
Delta Lake
-
ACID Transactions
-
Delta Live Tables
-
COPY INTO
-
Auto Loader
-
Convert Parquet or Iceberg data to Delta Lake
-
Scheduling the jobs