/data-scientist-disaster-response-pipeline

Data scientist disaster response pipeline

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Disaster Response Pipeline Project

Introduction

The project is divided into 3 parts.

  • ETL
  • Machine Learning Pipeline
  • Web App

Instructions

  1. Run the following commands in the project's root directory to set up your database and model.

    • To run ETL pipeline that cleans data and stores in database

      python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db

    • To run ML pipeline that trains classifier and saves

      python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl

  2. Run the following command in the app's directory to run your web app. python run.py

  3. Go to http://0.0.0.0:3001/

Deploying to AWS Elastic Beanstalk

Create a new EC2 key pair in EC2 management console > Network & Security > Key Pairs and download the .pem file.

Change the permission of the .pem file. It is required that your private key files are NOT accessible by others.

  • Remove everyone
  • Add user (edit, read/run, read, write)

Create a new web server environment using a preconfigured Python platform.

Click Configure more options and then modify Security

In EC2 key pair, choose the created key pair.

Before deploying the app to AWS Beanstalk, there are some guidelines that have to be followed:

  • Using application.py as the filename and providing a callable application object (the Flask object, in this case) allows Elastic Beanstalk to easily find your application's code
  • The Flask object within application.py must be application. By assigning app to be a reference to application, there is no need to rename all app to application
  • .ebextensions/python.config can be used to specify the file that contains the WSGI application. By this way, there is no need to change the application filename to application.py
option_settings:
  aws:elasticbeanstalk:container:python:
    WSGIPath: app/run.py

We also have to set AddGlobalWSGIGroupAccess to solve the issue discussed in the article Deploying SciPy into AWS Elastic Beanstalk

container_commands:
  AddGlobalWSGIGroupAccess:
    command: "if ! grep -q 'WSGIApplicationGroup %{GLOBAL}' ../wsgi.conf ; then echo 'WSGIApplicationGroup %{GLOBAL}' >> ../wsgi.conf; fi;"

When starting the app, you may see the below error because NLTK has not been installed and the corresponding file have not been downloaded.

LookupError:
**********************************************************************
  Resource \x1b[93mpunkt\x1b[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  \x1b[31m>>> import nltk
  >>> nltk.download('punkt')
  \x1b[0m
  Searched in:
    - '/home/wsgi/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/opt/python/run/venv/nltk_data'
    - '/opt/python/run/venv/lib/nltk_data'
    - ''
**********************************************************************

To install NLTK, we have to SSH to the EC2 instance created in beanstalk.

Connect to the EC2 instance using ssh

ssh -i "<Your pem file>" ec2-user@<Your EC2 instance>

If it is successful, you can see the below message.

 _____ _           _   _      ____                       _        _ _
| ____| | __ _ ___| |_(_) ___| __ )  ___  __ _ _ __  ___| |_ __ _| | | __
|  _| | |/ _` / __| __| |/ __|  _ \ / _ \/ _` | '_ \/ __| __/ _` | | |/ /
| |___| | (_| \__ \ |_| | (__| |_) |  __/ (_| | | | \__ \ || (_| | |   <
|_____|_|\__,_|___/\__|_|\___|____/ \___|\__,_|_| |_|___/\__\__,_|_|_|\_\
                                       Amazon Linux AMI

This EC2 instance is managed by AWS Elastic Beanstalk. Changes made via SSH
WILL BE LOST if the instance is replaced by auto-scaling. For more information
on customizing your Elastic Beanstalk environment, see our documentation here:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html

Install NLTK.

sudo pip install -U nltk

Download the NLTK data to the directory /usr/local/share/nltk_data.

sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

After downloading the NLTK data, the app should be able to start and run.

If it still fails, go to view the logs by navigating to Logs, and choose Request Logs > Last 100 lines

Screenshots

Main Screen

Distribution of Categories (Pie Chart)

Distribution of Categories (Top 10 and Last 10)

Message Classification (With Result)

Message Classification (No Result)

Discussions

Based on the categories that the ML algorithm classifies text into, the first three are Aid Related, Weather Related and Direct Report. So the messages should be sent to organizations that can provide medical aid and can evacuate people from extreme weather conditions.

This dataset is imbalanced (ie some labels like water have few examples). For the labels with so few examples, it means that there is not enough training data to build a good model.

Take the label water as an example:

precision recall f1-score support
0 0.97 0.99 0.98 4932
1 0.84 0.44 0.58 312
avg / total 0.96 0.96 0.96 5244

The recall rate is quite low (0.44). With low recall rate, it means many "positive" are now classified as "negative" and this may lead to missing a lot of disaster information. So, it is better to improve the recall rate so that the false negatives can be reduced.