Pulsars are rapidly rotating neutron stars that emit beams of electromagnetic radiation. Detecting these celestial objects is challenging due to their rarity and the overwhelming noise in astronomical data. This project leverages advanced data processing and machine learning techniques to identify pulsars from large datasets efficiently.
This project aims to showcase the application of MLOps tools in a typical industrial project workflow. By integrating tools like MLflow, Apache Beam, Apache Airflow, and Prometheus, this project demonstrates how to streamline and automate the end-to-end process of data acquisition, model training, deployment, and monitoring. This approach ensures scalability, reproducibility, and efficient management of machine learning models in real-world scenarios.
To deploy the Pulsar Detection Project, you can either use Docker for containerization or run the application directly on your machine.
git clone https://github.com/SivaSankarS365/Pulsar-Detection.git
-
Install the required Python packages:
pip install -r requirements.txt
-
Start the FastAPI application:
python deploy/api-app.py modeling/best_xgboost_model.pkl
-
Build the Docker image:
docker build -t pulsar-fastapi-app .
-
Run the Docker container:
docker run -d -p 5000:5000 -p 18000:18000 pulsar-fastapi-app
You can monitor the application using Prometheus and Node Exporter:
-
Start Node Exporter:
./node_exporter --web.listen-address=:9200 &
-
Start Prometheus with the specified configuration:
"$PATHTOPROMETHEUSBINARY$" --config.file=deploy/prometheus.yml
Pulsars are challenging to detect due to their sparse occurrence and the vast amount of noise in the data. This project addresses this challenge by applying robust data processing and machine learning techniques to analyze large datasets and accurately identify pulsars.
PulsarDetectionProject/
├── deploy/
│ └── api-app.py
├── download/
│ ├── fetch_data_dag.py
│ └── data/
├── modeling/
│ ├── train_model.py
│ └── models/
├── pulsar_processing.py
├── requirements.txt
└── README.md
api-app.py
: Contains the FastAPI application code for deploying the pulsar detection service.
fetch_data_dag.py
: An Airflow DAG that automates downloading the latest pulsar data. The data is stored in thedata/
subdirectory.
Usage:
- Place
fetch_data_dag.py
in the Airflow DAGs directory. - Start the Airflow scheduler to fetch data periodically.
train_model.py
: Script to train machine learning models for pulsar detection using MLflow for experiment tracking.models/
: Directory to store the trained models.
Usage:
- Run
train_model.py
to start training. - Use MLflow UI to track and manage experiments.
This script processes the raw pulsar data using Apache Beam, performing data cleaning, feature engineering, and preparation for modeling.
Usage:
- Ensure Apache Beam is installed.
- Run the script:
python pulsar_processing.py
MLflow is an open-source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. In this project, MLflow is used to track the performance of various models and efficiently manage the trained models.
Apache Beam is a unified programming model for defining and executing data processing pipelines. It is utilized in this project for scalable and efficient processing of large pulsar datasets.
Apache Airflow is an open-source tool for programmatically authoring, scheduling, and monitoring workflows. In this project, Airflow automates the data fetching process, ensuring that the models are always trained on the most recent data.
Prometheus is an open-source systems monitoring and alerting toolkit. In this project, Prometheus is used to monitor the application and infrastructure, providing real-time metrics and insights to ensure the system's health and performance.
This project combines cutting-edge machine learning, data processing, and workflow automation tools to address the challenge of detecting pulsars in noisy astronomical data. By leveraging Docker for deployment and tools like MLflow, Apache Beam, and Airflow, we ensure that the project is scalable, reproducible, and easy to maintain.