UNDP WG 3 SDG Classification

Introduction

This project in partnershp with SDG AI LAB aims to use supervised learning models to classify text/articles into the 17 SDGs.

Libraries Used

Data Collection	Tweepy, Pygoogle, GPT2, GPTNEO
Data Visualization	PyCaret LDA, WordCloud
Modeling	Sklearn Logistic Regression, Word2Vec, Hugging Face, Pytorch

Data Collection

Steps of Collected the Data:

GPT2/Neo/Xlmnet Sentence Generation using Hugging Face
- Code
- Data
Twitter API using Tweepy to collect Tweets #/@ using Keyword List from Ontology
- Code
- Data
Google News Scrape using pygoogle collected news on 17 SDGs 2018-2020
- Code
- Data

Data Visualization and Findings

Data Distribution

Tweet Distribution collection of the SDGs

Text Preprocessing:

TFIDF

def clean_tweet(df):
    """function first creates a copy. Then cleans up text for http, @, ampersands, and clears for punctuations.  Then lemmatizes, tokenizes.
    """
    import re 
    
    df = df.copy()
    
    #decontract
    df = df.apply(decontracted)
    #clean up https,@, &amp
    df = df.apply(lambda x: re.sub(r"http\S+","", x.lower()),1)\
    .apply(lambda i: " ".join(filter(lambda x: x[0]!="@", i.split())),1)\
    .apply(lambda x: re.sub(r"&amp", "",x),1)\
    .apply(lambda x: re.sub(r"&amp;","",x))\
    .apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)
    
    #lemmatize
    df = df.apply(lambda x: lemm.lemmatize(x))
    
    #tokenize
    df = df.apply(lambda x: tknzr.tokenize(x))
    df = df.apply(lambda x: ' '.join(x))
    return df

Word2Vec

def clean_text(df):
    import re
    df = df.copy().reset_index(drop = True)
     
    #decontract
    df = df.apply(decontracted)
    
    df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&amp", "",x),1)\
.apply(lambda x: re.sub(r"&amp;", "",x),1)\
.apply(lambda x: str(x).lower()).replace('\\', '').replace('_', ' ')\
.apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)

    return df

BERT

def clean_text(df):
    import re
    df = df.copy().reset_index(drop = True)
    df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&amp", "",x),1)\
.apply(lambda x: re.sub(r"&amp;", "",x),1)
    return df

LDA and Word Cloud:

LDA PyCaret Model

LDA TOPIC Distribution 17 SDGs

Unigram Word Count

Bigram Word Count

Trigram Word Count

KMeans Clustering WordCloud

Kmeans Word Cloud 17 SDGS

Industry and Infrastructure SDG 9 WordCloud

Responsible Consumption and Production SDG 12 WordCloud

Modeling

Log Reg TFIDF:

Note Book

Log Reg Word2Vec:

Note Book

BERT Transfer Learning with Hugging Face:

Token Distribution - 100 Tokens used for Tweet BERT model with Case since HI!!! and hi. could mean different things.

Note Book

Results