The objective of this project is to solve the Titanic - Machine Learning from Disaster problem from the Kaggle Competition using classification algorithms. Specifically, we will be using the Pyspark library in contrast to Pandas.
Pandas vs Pyspark:
1. In-memory Computation
Pandas
Pyspark
Pandas operates primarily in-memory, which means it loads the entire dataset into memory for processing. This can be limiting when working with very large datasets that don't fit in memory.
PySpark also performs in-memory computation, but it can efficiently handle large datasets by distributing the data across a cluster's memory.
2. Distributed Processing using Parallelize
Pandas
Pyspark
Pandas does not have native support for distributed processing or parallelization. It's designed for single-machine data analysis.
PySpark is designed for distributed processing and can parallelize computations across a cluster of machines, making it suitable for big data processing.
3. Cluster Managers
Pandas
Pyspark
Pandas does not integrate with cluster managers like Spark, Yarn, or Mesos.
PySpark is designed to work with various cluster managers, such as Spark's built-in cluster manager, YARN, and Mesos, allowing it to leverage cluster resources efficiently.
4. Fault-Tolerant
Pandas
Pyspark
Pandas does not have built-in fault tolerance features.
PySpark is designed for fault tolerance. It can recover from node failures in a cluster and continue processing.
5. Immutable
Pandas
Pyspark
Pandas DataFrames are mutable, meaning you can modify them in place.
PySpark DataFrames are immutable, which means any transformation on a DataFrame creates a new DataFrame. This immutability simplifies parallel processing and fault tolerance.
6. Lazy-evaluation
Pandas
Pyspark
Pandas does not support lazy evaluation.
PySpark supports lazy evaluation, which means transformations on DataFrames are not executed immediately but are deferred until an action is performed. This optimizes query execution.
7. Cache & Persistence
Pandas
Pyspark
Pandas does not provide built-in mechanisms for caching or persisting data.
PySpark allows you to cache intermediate DataFrames in memory for faster access during iterative computations, improving performance.
7. Inbuilt Optimization with DataFrames
Pandas
Pyspark
Pandas does not provide built-in optimization for distributed computing.
PySpark's DataFrames are designed for optimized distributed computing. The Catalyst query optimizer and Tungsten execution engine help improve query performance.
9. Supports ANSI SQL
Pandas
Pyspark
Pandas does not directly support ANSI SQL, but you can use SQL-like syntax with the pandasql library.
PySpark has built-in support for ANSI SQL through its Spark SQL module, allowing you to run SQL queries on DataFrames.