/Project-2--Disaster-Response-Pipeline-Udacity

This project creates a machine learning pipeline to categorize disaster events from a dataset containing real messages. It includes a web app where an emergency worker can input a new message and get classification results in several categories.

Primary LanguageJupyter Notebook

Project-2--Disaster-Response-Pipeline-Udacity

enter_message

Goal

The project goal is to create a machine learning pipeline to classify disaster events from a dataset provided by Figure Eight containing real messages. The final outcome is a web app where an emergency worker can enter a new message and get classification results in different categories.

Installation

  • Python3
  • Machine Learning Libraries: NumPy, Pandas, Scikit-Learn
  • Natural Language Process Libraries: nltk
  • SQLlite Database Libraries: SQLalchemy
  • Model Loading and Saving Library: Pickle
  • Web App and Data Visualization: Flask, Plotly

Instructions for execution:

  1. Run the following commands in the project's root directory to set up your database and model.

    • To run ETL pipeline that cleans data and stores in database python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/disaster_response.db
    • To run ML pipeline that trains classifier and saves python models/train_classifier.py data/disaster_response.db models/classifier.pkl
  2. Run the following command in the app's directory to run your web app. python run.py

  3. Go to http://0.0.0.0:3001/ to see the web app.

File Description

The notebooks folder contains two jupyter notebooks that help you understand how the pipeline scripts are built step by step:

  • ETL Pipeline Preparation: Loads the datasets, merges them, cleans the data and stores them in a SQLite database.
  • ML Pipeline Preparation: Loads the dataset from SQLite database, splits data into train and test set, builds a text preprocessing and ML pipeline, trains and tunes models using GridSearch (SVM, Random Forest), outputs reults on the test set and exports the final model as a pickle file.

Python scripts:

  • data/process_data.py - ETL pipeline
  • models/train_classifier.py - ML Pipeline
  • app/run.py - Flask Web App

Datasets:

  • messages.csv: Contains the id, message and genre, i.e. the method (direct, social, ...) the message was sent.
  • categories.csv: Contains the id and the categories (related, offer, medical assistance..) the message belonges to.

Results

The final output of the project is an interactive web app that takes a message from the user as an input and then classifies it into the respective categories.

test_message1 test_message2

Classification Report running a Linear Support Vector Machines Classifier.

classification_report

Distribution of Top 10 Categories by Genre

top10cat_direct_ top10cat_social_ top10cat_news_

Note

The dataset has highly imbalanced classes, i.e. there is an uniqual representation of classes. This affects the ML algorithms because the probability that the instances belong to the majority class is significantly high, so the algorithms are more likely to classify new observations to the majority class.

Possible approaches to address imbalanced data are:

  • Boosting the predictive performance on minority class, using recognition-based learning or cost-sensitive learning.
  • Resampling the data (over-sampling, under-sampling, SMOTE).

Licensing, Authors, Acknowledgments

This project has been completed as part of the Data Science Nanodegree on Udacity. The data was collected by Figure Eight and provided by Udacity.