/Kaggle-Titanic-Challenge-Pyspark

Solving Kaggle Titanic with Pyspark libraries

Primary LanguageJupyter Notebook

Kaggle Titanic Challenge con Pyspark

spark-logo

Equipo 4:

Alumn

Matricula

Mauricio Juárez Sánchez
A01660336
Alfredo Jeong Hyun Park
A01658259
Fernando Alfonso Arana Salas
A01272933
Miguel Ángel Bustamante Pérez
A01781583

Challenge Kaggle – Titanic classification

The objective of this project is to solve the Titanic - Machine Learning from Disaster problem from the Kaggle Competition using classification algorithms. Specifically, we will be using the Pyspark library in contrast to Pandas.

Pandas vs Pyspark:

1. In-memory Computation

Pandas

Pyspark

Pandas operates primarily in-memory, which means it loads the entire dataset into memory for processing. This can be limiting when working with very large datasets that don't fit in memory.
PySpark also performs in-memory computation, but it can efficiently handle large datasets by distributing the data across a cluster's memory.

2. Distributed Processing using Parallelize

Pandas

Pyspark

Pandas does not have native support for distributed processing or parallelization. It's designed for single-machine data analysis.
PySpark is designed for distributed processing and can parallelize computations across a cluster of machines, making it suitable for big data processing.

3. Cluster Managers

Pandas

Pyspark

Pandas does not integrate with cluster managers like Spark, Yarn, or Mesos.
PySpark is designed to work with various cluster managers, such as Spark's built-in cluster manager, YARN, and Mesos, allowing it to leverage cluster resources efficiently.

4. Fault-Tolerant

Pandas

Pyspark

Pandas does not have built-in fault tolerance features.
PySpark is designed for fault tolerance. It can recover from node failures in a cluster and continue processing.

5. Immutable

Pandas

Pyspark

Pandas DataFrames are mutable, meaning you can modify them in place.
PySpark DataFrames are immutable, which means any transformation on a DataFrame creates a new DataFrame. This immutability simplifies parallel processing and fault tolerance.

6. Lazy-evaluation

Pandas

Pyspark

Pandas does not support lazy evaluation.
PySpark supports lazy evaluation, which means transformations on DataFrames are not executed immediately but are deferred until an action is performed. This optimizes query execution.

7. Cache & Persistence

Pandas

Pyspark

Pandas does not provide built-in mechanisms for caching or persisting data.
PySpark allows you to cache intermediate DataFrames in memory for faster access during iterative computations, improving performance.

7. Inbuilt Optimization with DataFrames

Pandas

Pyspark

Pandas does not provide built-in optimization for distributed computing.
PySpark's DataFrames are designed for optimized distributed computing. The Catalyst query optimizer and Tungsten execution engine help improve query performance.

9. Supports ANSI SQL

Pandas

Pyspark

Pandas does not directly support ANSI SQL, but you can use SQL-like syntax with the pandasql library.
PySpark has built-in support for ANSI SQL through its Spark SQL module, allowing you to run SQL queries on DataFrames.