SIH2K22

Problem statement: Extraction of Data including voice and images from various social Media platforms from Disaster struck areas
Team – gitinitrepo
Ministry : National Disaster Response Force (NDRF)

Team Members:

Gautham Prabhu (Team Leader), Anurag Chowdhury, Soumya A R, Anshita Palorkar, Tanay Gupta, MV Srujan;

We represented our college Manipal Institute of Technology in the hackathon

Abstract

VIKAS, A real-time, multimodal solution linking disaster victims and first responders from NDRF: streamlining support to the most vulnerable.

In the event of a disaster, many people turn to social media to seek support, both material and mental. The data from these posts aids in increasing situational awareness as soon as possible. Text, images, videos, and audio extracted in real-time from these social media posts play a crucial role in identifying appropriate emergency responses to a particular disaster. Once irrelevant information is filtered out, deep-learning-based classification, object identification and natural language processing methods are used to expedite emergency response decision-making processes. Easy-to-interpret visualizations provide details further facilitate the distribution of resources and dispatch required personnel to affected areas.

Our solution involves the following components:

Data extraction- Realtime extraction of raw data
Analysis of extracted data
Visualization for data

Data Extraction

Our project mainly deals with data that is available from tweets. This generally comprises texts and images extracted from the respective tweets.

Text extraction

Our solution uses the Twitter API to access tweets in realtime during the occurence of a disaster. The Twitter API can be used to programmatically retrieve and analyze Twitter data, as well as build for the conversation on Twitter.

Tweepy is an easy to use Python library for accessing the Twitter API.

The easiest way to install the latest version from PyPI is by using pip:

pip install tweepy

Using the API, A keyword and stream time are entered, and data is streamed real-time in a database framework. This data may include the text content and geolocations of the tweets along with links to images and videos.

For cleaning of data and preprocessing our model does the following:

Remove non alphanumeric characters (ex. punctuation)
Remove emoticons, URLs, emails, ‘RT’ using regular expression.
Duplicated tweets are placed in a different pandas dataframe.
Irrelevant tweets are filtered out using Natural Language Processing.

Tweets filtered for the keyword 'Flood'

We also developed a Word Cloud which generates a collection of words which are associated with the disaster. These words are generated according to the frequency of their usage and their relevance with respect to the disaster.

To run the word cloud you need to first install the python library by the following command

pip install wordcloud

An illustration of the word cloud for the tweets relevant to Japan bombings

For determining the relevancy of a tweet, we used the Crisis NLP dataset. We use a BERT based model to analyze this text which is described in the next section.

Image extraction

Images are extracted from image links present after the relevant tweets are filtered. These images are then analysed later using Computer Vision models.

Analysis of extracted data

Analysing text

Tweets and text posts often contain crucial information about the locations affected by a particular by a disaster and the amount of resources required. Hence after the extraction of text, we make word embeddings. These word embeddings are then classified as disaster and non disaster related.

We also made an LSTM based RNN model which helps us obtain important statstics with respect to a particular disaster. These stastics often include important landmarks and locations which we can represent in a map.

Demo: https://view-awesome-table.com/-NAOX2coHuKs-_YWPfhy/view

Analysing audio

Speech Recognition using the Google Speech API. Audios will be extracted from videos and converted into text using Speech-to-Text. The text is further analysed using the models mentioned above.

Analysing photos and videos

We use CNN-based classification and object detection models to classify images and detect disaster-related labels.

We first classify images as relevant or irrelevant depending upon on the disaster. In this model we fine tune an existing model, the Resnet50. This refined model is built on fastai. Further the relevant images are then taken and classified based on severity and the type of disaster.

DL models are used for this classification. Types of damage include fire damage, natural damage, infrastructure damage, and flood damage and severity ranges from mild to severe.

Visualizations

Visualizations are available on this link. It describes the data made avaiable after applying various ML techniques described above.

https://www.figma.com/file/QMKn8FxcbEtSY5KKajtXLF/Vikas-Dashboard?node-id=301%3A2872 (English) https://www.figma.com/file/yGQMVaYngLebKpmQoXTl8y/Vikas-Dashboard-(Hindi)?node-id=0%3A1 (Hindi)

Future scope and limitations

Language

More research into Indian language processing, pre-trained models, and corpora collection will help us expand our project to a diverse range of localities.

More Data Sources

Higher level access to public APIs, and access to APIs that are not currently public (ex. Facebook, Koo) could provide us with more information to improve our accuracy, redundancy checks, and account for outliers.

Location Data

With better documentation of local landmarks, we can refine the search space and improve map visualisation.

Cloud Computing Resources

MLaaS and PaaS will increase processing power, reduce model training time, make it easier to manage storage and updation.

GauthamPrabhuM/SIH2K22