pyspark-hdfs-docker

Running PySpark jobs using Docker

Apache Spark is an open-source, distributed computing system that offers a comprehensive framework for handling big data processing and analytics. Spark provides high-level APIs in Java, Scala, Python, and R, making it accessible to a wide range of developers and data scientists. It is designed to perform both batch processing and real-time data processing efficiently. Spark achieves high performance for both streaming and batch data using a DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine. At its core, Spark facilitates the development of complex data pipelines and iterative algorithms, while its in-memory computing capabilities allow it to process large datasets quickly. Spark's ecosystem includes several libraries for diverse tasks like SQL and DataFrames, machine learning (MLlib), graph processing (GraphX), and stream processing (Spark Streaming), making it a versatile tool for big data analytics and machine learning projects.

PySpark is the Python API for Apache Spark, allowing Python programmers to leverage the capabilities of Spark using Python's rich ecosystem. It provides a way to interface with Spark's distributed computing capabilities, enabling data scientists and analysts to execute data analysis and machine learning tasks at scale. PySpark offers Python wrappers for Spark's distributed data collections (RDDs and DataFrames) and its powerful APIs for streaming, SQL, machine learning, and graph processing. This seamless integration with Python enables users to combine the simplicity and flexibility of Python with the power of Apache Spark to analyze big data and derive insights. PySpark has become particularly popular in the data science community, where Python is widely used, as it allows for easy development of scalable and complex data pipelines and algorithms, all within the Python environment.

Instructions

Build Docker image:

docker build -t sparky:latest .

Create a docker network:

docker network create sparky

Run Spark master container:

docker rm spark-master
docker run --name spark-master \
    -e SPARK_MODE=master \
    -e PYSPARK_PYTHON=python3.11 \
    -e PYSPARK_DRIVER_PYTHON=python3.11 \
    -p 8080:8080 -p 8081:8081 -p 7077:7077 \
    --network sparky \
    sparky:latest

Run Spark worker container:

docker rm spark-worker-1
docker run --name spark-worker-1 \
    -e SPARK_MODE=worker \
    -e PYSPARK_PYTHON=python3.11 \
    -e PYSPARK_DRIVER_PYTHON=python3.11 \
    -e SPARK_MASTER_URL=spark://spark-master:7077 \
    --network sparky \
    sparky:latest

Visit http://localhost:8080.
Install Python dependencies:

virtualenv -p python3.11 .env
source .env/bin/activate
pip install -r requirements.txt

Calculate pi using PySpark:

python3.11 pi.py

Pi is roughly 3.134520

Calculate average age of

python3.11 filter.py

Average Age of People Older Than 25:
+------------------+
|          avg(Age)|
+------------------+
|62.036144578313255|
+------------------+

Generate a dataset for Elasticsearch:

python3.11 elastic.py

Elasticsearch document 1: {"Date":"2024-02-06T22:07:43.129-03:00","Store":"LJbQxEXVhR","Product":"HrSQpQofYw","Method":"CASH","Total":37,"Currency":"USD"}

MartinCastroAlvarez/pyspark-docker

pyspark-hdfs-docker

Instructions