/PySpark

Primary LanguageJupyter Notebook

PySpark SQL

Business Overview

Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.

Data Pipeline

A data pipeline is a technique for transferring data from one system to another. The data may or may not be updated, and it may be handled in real-time (or streaming) rather than in batches. The data pipeline encompasses everything from harvesting or acquiring data using various methods to storing raw data, cleaning, validating, and transforming data into a query-worthy format, displaying KPIs, and managing the above process.

Project

This project mainly focuses on PySpark SQL, SQL function, and various joins available in PySpark SQL

Tech stack:

➔Language: Python

➔Package: Pyspark