/Pyspark-Tutorial

This repository is part of my journey to learn **PySpark**, the Python API for Apache Spark. I explored the fundamentals of distributed data processing using Spark and practiced with real-world data transformation and querying use cases.

Primary LanguageJupyter Notebook

๐Ÿ”ฅ PySpark Essentials

This project is a hands-on collection of notebooks, code snippets, and exercises focused on learning Apache Spark with Python (PySpark). It includes my notes and experiments while exploring core Spark concepts, transformations, actions, DataFrame API, and more.


๐Ÿš€ What is PySpark?

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing engine used for large-scale data processing and analytics. PySpark allows you to leverage the power of distributed computing using Python.


๐Ÿ“˜ Topics Covered

  • โœ… Introduction to Spark & PySpark
  • โœ… SparkContext & SparkSession
  • โœ… RDDs (Resilient Distributed Datasets)
  • โœ… DataFrames & Datasets
  • โœ… Transformations vs Actions
  • โœ… Reading/Writing: JSON, CSV, Parquet
  • โœ… PySpark SQL & Queries
  • โœ… GroupBy, Aggregations, Joins
  • โœ… Handling Nulls & Missing Data
  • โœ… User-Defined Functions (UDFs)
  • โœ… Window Functions
  • โœ… Data Partitioning & Performance Optimization
  • โœ… Intro to MLlib (Optional)

โœ๏ธ How I Learn

I follow a "Learn by Doing" approach. Each notebook contains:

โœ… Detailed explanations

๐Ÿงช Hands-on code examples

๐Ÿ“Œ Real-world case studies