/UNDP_SDG_NER

Primary LanguageJupyter Notebook

UNDP WG 3 SDG Classification

Introduction

This project in partnershp with SDG AI LAB aims to use supervised learning models to classify text/articles into the 17 SDGs.

Libraries Used

Data Collection Tweepy, Pygoogle, GPT2, GPTNEO
Data Visualization PyCaret LDA, WordCloud
Modeling Sklearn Logistic Regression, Word2Vec, Hugging Face, Pytorch

Data Collection

Steps of Collected the Data:

  • GPT2/Neo/Xlmnet Sentence Generation using Hugging Face

  • Twitter API using Tweepy to collect Tweets #/@ using Keyword List from Ontology

  • Google News Scrape using pygoogle collected news on 17 SDGs 2018-2020

Data Visualization and Findings

Data Distribution

Tweet Distribution collection of the SDGs

Text Preprocessing:

TFIDF

def clean_tweet(df):
    """function first creates a copy. Then cleans up text for http, @, ampersands, and clears for punctuations.  Then lemmatizes, tokenizes.
    """
    import re 
    
    df = df.copy()
    
    #decontract
    df = df.apply(decontracted)
    #clean up https,@, &amp
    df = df.apply(lambda x: re.sub(r"http\S+","", x.lower()),1)\
    .apply(lambda i: " ".join(filter(lambda x: x[0]!="@", i.split())),1)\
    .apply(lambda x: re.sub(r"&amp", "",x),1)\
    .apply(lambda x: re.sub(r"&","",x))\
    .apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)
    
    #lemmatize
    df = df.apply(lambda x: lemm.lemmatize(x))
    
    #tokenize
    df = df.apply(lambda x: tknzr.tokenize(x))
    df = df.apply(lambda x: ' '.join(x))
    return df

Word2Vec

def clean_text(df):
    import re
    df = df.copy().reset_index(drop = True)
     
    #decontract
    df = df.apply(decontracted)
    
    df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&amp", "",x),1)\
.apply(lambda x: re.sub(r"&", "",x),1)\
.apply(lambda x: str(x).lower()).replace('\\', '').replace('_', ' ')\
.apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)

    return df

BERT

def clean_text(df):
    import re
    df = df.copy().reset_index(drop = True)
    df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&amp", "",x),1)\
.apply(lambda x: re.sub(r"&", "",x),1)
    return df

LDA and Word Cloud:

LDA TOPIC Distribution 17 SDGs

Unigram Word Count

Bigram Word Count

Trigram Word Count

Kmeans Word Cloud 17 SDGS

Industry and Infrastructure SDG 9 WordCloud

Responsible Consumption and Production SDG 12 WordCloud

Modeling

Log Reg TFIDF:

Log Reg Word2Vec:

BERT Transfer Learning with Hugging Face:

Token Distribution - 100 Tokens used for Tweet BERT model with Case since HI!!! and hi. could mean different things.

Results

Log Reg TFIDF:

Log Reg Word2Vec:

BERT Transfer Learning with Hugging Face:

BERT SDG Confusion Matrix

Tweet:

You have to agree with The on this, I DON'T agree on EVERYTHING the
GOP has to say much ANYMORE. But I have to admit that condemning China
for what they're currently going would've been a lot BETTER than what
he said that day.

True SDG: 14

Prediction Proba:

Tweet:

Yeah: one might even suggest that that the full policy consequences of
moral panics about wokeness or cancel culture in elite private spaces
like Oberlin or Harvard are visited upon public higher ed in red state
institutions

True SDG: 0

Prediction Proba:

Author