This guide provides instructions on how to set up an Apache Spark environment with JupyterLab using Docker and Docker Compose, including integration with a PostgreSQL database.
- Docker (Docker Desktop for Windows)
- Git (if cloning the repository)
The Dockerfile is configured to automatically execute the Spark master and a single worker node when the container starts, along with a PostgreSQL database.
-
Clone the Repository (Skip this step if you already have the files):
git clone https://github.com/pablo-git8/spark-project.git cd spark-project
-
Build the Docker Image: Navigate to the directory containing your Dockerfile and run:
docker-compose build
-
Run the Docker Containers: Start the services defined in
docker-compose.yml
:docker-compose up -d
The command in the Dockerfile is configured to automatically execute the Spark master and a single worker node when the container starts. This allows the Spark master to manage jobs that are executed on the worker node.
After the setup, validate the services are running correctly:
-
Verify Container: Verify your container is created and running by executing the following command. Confirm the status of the container as 'Up':
docker ps
-
Access JupyterLab: Open your web browser and go to http://localhost:8888/lab. You should see the JupyterLab interface.
-
Access Spark Master Web UI: Open your web browser and go to http://localhost:8080. You should see the Apache Spark master's web UI.
-
Access PostgreSQL Database: Connect to the PostgreSQL database using the credentials specified in the .env file, accessible via SQL clients or notebooks in JupyterLab. The database will be accessible with your credentials in
127.0.0.1/0.0.0.0:5432
, if not, tryhost.docker.internal:5432
.
If you are unable to access these interfaces, consult the container logs for troubleshooting:
docker logs spark-container
- Your local
./data
directory is mounted to/data
inside the container for persistent data storage. - Your local
./notebooks
directory is mounted to/opt/bitnami/spark/work/notebooks
inside the container, where you can place and manage Jupyter notebooks. - A PostgreSQL database is available as a service, accessible using the credentials defined in the .env file.
Example Python code snippet in JupyterLab to connect to PostgreSQL:
import psycopg2
# Establish a connection
conn = psycopg2.connect(
dbname="your_db_name",
user="your_user",
password="your_password",
host="postgres-container"
)
cursor = conn.cursor()
# Remember to close the connection when done
cursor.close()
conn.close()
Ensure to create a .env in your repository path and include the required PostgreSQL credentials. You can see an example in the .env-ex mock file. Note: You may need to copy and paste your .env file inside your jupyter notebook path (depends on how you want to call your credentials).
To stop the running services and clean up containers, networks, and volumes, use the following commands:
# Stop and remove containers and their networks
docker-compose down
# OPTIONAL: To also remove the volumes and images
docker-compose down -v --rmi all
# OPTIONAL: To remove the built Docker image
docker rmi spark-project-spark-jupyter
Warning: The two last commands will remove all unused volumes and images not associated with running containers. They are optional and should be used with caution.