/PySpark_Projects

Five PySpark Projects

Primary LanguageJupyter Notebook

PySpark Projects


Apache Spark is a unified analytics engine for large-scale data processing.
It offers high-level APIs in Scala, Java, Python and R. It has an optimized engine that supports general computation graphs for data analysis.
What is more, Apache Spark supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLib for machine learning, GraphX for graph processing and Structured Streaming for stream processing.
Spark application consists of a driver program that runs the user's main function and executes various parallel operations on a cluster.
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of immutable elements partitioned across the nodes of the cluster that can be operated on in parallel.
It is also possible to ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Last but not least, RDDs automatically recover from node failures.
The second abstraction in Spark is shared variables that can be used in parallel operations. Spark supports two types of shared variables: broadcast variables and accumulators variables.
Furthermore, Lazy Evaluation is a key feature of Apache Spark. It improves efficiency and performance.
It refers to the strategy where transformations on distributed datasets, are not immediately executed, but their execution is delayed until an action is called. If an action is called, Spark starts looking at transformations and it creates a DAG. DAG is the sequence of operations that are necessary to perform in order to obtain the output.


This repository is made up of five projects.
All descriptions are related to these projects are included in each folder.

PySpark Projects:

To create README, this site Apache Spark was used.