Setting up a data lake for financial data visualization using Apache Kafka, Apache Spark, Apache Beam, Apache Druid, and Streamlit
This project aims to establish a data lake for visualizing financial data using Apache Spark, Apache Beam, Apache Druid, and Streamlit. Two data sources, namely Yahoo Finance and the New York Times API, are integrated through Apache Beam and stored in Parquet format. Using Spark ML, multiple sentiment analysis models are trained on a Kaggle dataset, and the best model is selected. The data is then analyzed with Spark SQL to predict NYT sentiment scores, and the result is stored in Apache Druid. Finally, Streamlit provides an interactive interface for exploring the results, including the data table and daily sentiment statistics, etc.
Before getting started, ensure your environment meets the following requirements:
- Operating System: Compatible with Docker and necessary tools.
- RAM: Minimum of 13 GB.
- Docker: Installed and verified.
- Docker Compose: Installed and verified.
System Architecture |
---|
![]() |
We use Apache Beam to create a data processing pipeline. It defines two data schemas, one for New York Times news articles and the other for financial information from Yahoo Finance, using the PyArrow library. The pipeline aims to retrieve article data via HTTP requests, as well as financial data from Yahoo Finance, and process them simultaneously. The schemas specify the data structure to ensure consistency during parallel processing in the Apache Beam pipeline.
The results of the Beam pipeline execution are written to Parquet files with explicit schemas for both New York Times and Yahoo Finance data.
Multiple sentiment analysis models are trained using Spark ML on a Zeppelin notebook, and the best model is saved in HDFS.
A pre-trained model is used to predict sentiment labels from a text column in a New York Times DataFrame. The results are then filtered and displayed.
Analyzing New York Times sentiment data by truncating abstracts, calculating the total number of abstracts, and the average sentiment per day, providing an overall summary of daily statistics, and displaying results sorted by date.
# Clone the repository from GitHub
git clone https://github.com/ElmansouriAMINE/mise-en-place-d-un-data-lake-master.git
cd mise-en-place-d-un-data-lake-master
# Make sure to be in the project directory
cd mise-en-place-d-un-data-lake-master
# Start the services in the background with Docker Compose
docker-compose up -d