feraranas/Kaggle-Titanic-Challenge-Pyspark

Solving Kaggle Titanic with Pyspark libraries

Jupyter Notebook

Kaggle Titanic Challenge con Pyspark

Equipo 4:

Alumn	Matricula
Mauricio Juárez Sánchez	A01660336
Alfredo Jeong Hyun Park	A01658259
Fernando Alfonso Arana Salas	A01272933
Miguel Ángel Bustamante Pérez	A01781583

Challenge Kaggle – Titanic classification

The objective of this project is to solve the Titanic - Machine Learning from Disaster problem from the Kaggle Competition using classification algorithms. Specifically, we will be using the Pyspark library in contrast to Pandas.

Pandas vs Pyspark:

1. In-memory Computation

Pandas	Pyspark
Pandas operates primarily in-memory, which means it loads the entire dataset into memory for processing. This can be limiting when working with very large datasets that don't fit in memory.	PySpark also performs in-memory computation, but it can efficiently handle large datasets by distributing the data across a cluster's memory.

2. Distributed Processing using Parallelize

Pandas	Pyspark
Pandas does not have native support for distributed processing or parallelization. It's designed for single-machine data analysis.	PySpark is designed for distributed processing and can parallelize computations across a cluster of machines, making it suitable for big data processing.

3. Cluster Managers

Pandas	Pyspark
Pandas does not integrate with cluster managers like Spark, Yarn, or Mesos.	PySpark is designed to work with various cluster managers, such as Spark's built-in cluster manager, YARN, and Mesos, allowing it to leverage cluster resources efficiently.

4. Fault-Tolerant

Pandas	Pyspark
Pandas does not have built-in fault tolerance features.	PySpark is designed for fault tolerance. It can recover from node failures in a cluster and continue processing.

5. Immutable

Pandas	Pyspark
Pandas DataFrames are mutable, meaning you can modify them in place.	PySpark DataFrames are immutable, which means any transformation on a DataFrame creates a new DataFrame. This immutability simplifies parallel processing and fault tolerance.

6. Lazy-evaluation

Pandas	Pyspark
Pandas does not support lazy evaluation.	PySpark supports lazy evaluation, which means transformations on DataFrames are not executed immediately but are deferred until an action is performed. This optimizes query execution.

7. Cache & Persistence

Pandas	Pyspark
Pandas does not provide built-in mechanisms for caching or persisting data.	PySpark allows you to cache intermediate DataFrames in memory for faster access during iterative computations, improving performance.

7. Inbuilt Optimization with DataFrames

Pandas	Pyspark
Pandas does not provide built-in optimization for distributed computing.	PySpark's DataFrames are designed for optimized distributed computing. The Catalyst query optimizer and Tungsten execution engine help improve query performance.

9. Supports ANSI SQL

Pandas	Pyspark
Pandas does not directly support ANSI SQL, but you can use SQL-like syntax with the pandasql library.	PySpark has built-in support for ANSI SQL through its Spark SQL module, allowing you to run SQL queries on DataFrames.