This project in partnershp with SDG AI LAB aims to use supervised learning models to classify text/articles into the 17 SDGs.
Data Collection | Tweepy, Pygoogle, GPT2, GPTNEO |
Data Visualization | PyCaret LDA, WordCloud |
Modeling | Sklearn Logistic Regression, Word2Vec, Hugging Face, Pytorch |
Steps of Collected the Data:
-
GPT2/Neo/Xlmnet Sentence Generation using Hugging Face
-
Twitter API using Tweepy to collect Tweets #/@ using Keyword List from Ontology
-
Google News Scrape using pygoogle collected news on 17 SDGs 2018-2020
Data Distribution
Tweet Distribution collection of the SDGs
Text Preprocessing:
TFIDF
def clean_tweet(df):
"""function first creates a copy. Then cleans up text for http, @, ampersands, and clears for punctuations. Then lemmatizes, tokenizes.
"""
import re
df = df.copy()
#decontract
df = df.apply(decontracted)
#clean up https,@, &
df = df.apply(lambda x: re.sub(r"http\S+","", x.lower()),1)\
.apply(lambda i: " ".join(filter(lambda x: x[0]!="@", i.split())),1)\
.apply(lambda x: re.sub(r"&", "",x),1)\
.apply(lambda x: re.sub(r"&","",x))\
.apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)
#lemmatize
df = df.apply(lambda x: lemm.lemmatize(x))
#tokenize
df = df.apply(lambda x: tknzr.tokenize(x))
df = df.apply(lambda x: ' '.join(x))
return df
Word2Vec
def clean_text(df):
import re
df = df.copy().reset_index(drop = True)
#decontract
df = df.apply(decontracted)
df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&", "",x),1)\
.apply(lambda x: re.sub(r"&", "",x),1)\
.apply(lambda x: str(x).lower()).replace('\\', '').replace('_', ' ')\
.apply(lambda x: re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', x),1)
return df
BERT
def clean_text(df):
import re
df = df.copy().reset_index(drop = True)
df = df.apply(lambda x: re.sub(r"http\S+", "", x), 1)\
.apply(lambda i: " ".join(filter(lambda x:x[0]!="@", i.split())), 1)\
.apply(lambda x: re.sub(r"&", "",x),1)\
.apply(lambda x: re.sub(r"&", "",x),1)
return df
LDA and Word Cloud:
LDA TOPIC Distribution 17 SDGs
Unigram Word Count
Bigram Word Count
Trigram Word Count
Kmeans Word Cloud 17 SDGS
Industry and Infrastructure SDG 9 WordCloud
Responsible Consumption and Production SDG 12 WordCloud
Log Reg TFIDF:
Log Reg Word2Vec:
BERT Transfer Learning with Hugging Face:
Token Distribution - 100 Tokens used for Tweet BERT model with Case since HI!!! and hi. could mean different things.
Log Reg TFIDF:
Log Reg Word2Vec:
BERT Transfer Learning with Hugging Face:
BERT SDG Confusion Matrix
Tweet:
You have to agree with The on this, I DON'T agree on EVERYTHING the
GOP has to say much ANYMORE. But I have to admit that condemning China
for what they're currently going would've been a lot BETTER than what
he said that day.
True SDG: 14
Prediction Proba:
Tweet:
Yeah: one might even suggest that that the full policy consequences of
moral panics about wokeness or cancel culture in elite private spaces
like Oberlin or Harvard are visited upon public higher ed in red state
institutions
True SDG: 0
Prediction Proba:
- Justin Huang Into anime, finance computer vision, and NLP. GitHub: Jvhuang1786