NLP pipline to gather and analyze information regarding the events of Ayotzinapa
Final project for Montclair State University Computational Linguistics program meant to show what I have learned
Download featurefunctions Script Download Process Script requirements.txt
Follow *Full_process above
See the Final paper for this project:
Final dataset : https://drive.google.com/drive/folders/1_INCIZGd0pgqA3TeJ3xLjn-_7jy9ZHup?usp=sharing
Script is meant to speed up NLP work
-
featurefunctions: turns Tweets into linguistic features and preprocess
-
Process: Preprocesses data again
- Ayotzinapa Twitter Dataset: Users gain access to a Twitter Dataset that covers topics surrounding Ayotzinapa from August 2014 to December 2015.Users also gain access to web scraper and Methodology used to collect Ayotzinapa tweets.
- Linguistic feature returns: Users can get count data for part-of-speech representation in each Tweet, overall sentiment scores (see Spanish Sentiment Analyzer model), and article complexity returns (i.e. sentence count, word density).
- Topic Modeling: Users can get a breakdown of major topics in their dataset as well as corresponding keywords for each.
- Ayotzinapa Timeline: Users get list of carefully curated events surrounding Ayotzinapa that can be studied further to see if there is any correlation between the linguist feature data and real life happenings.
- Data: All of the resulting data can be put through data visualization software to see if there are any patters.
Column Name | Data Type | Description |
---|---|---|
text |
str | Primary input for classification model |
dominant_topic |
float | Topic assigned to the Tweet |
topic_percent_contrib |
float | How closely aligned the Tweet is with its assigned topic |
keywords |
str | Associated keywords for a given topic |
noun_count_percent |
float | number of nouns in a given text |
adj_count_percent |
float | number of adjectives in a given text |
adv_count_percent |
float | number of adverbs in a given text |
pro_count_percent |
float | number of pronouns in a given text |
char_count |
float | total characters in a given text |
word_density |
float | average words per sentence |
punctuation_count |
float | punctuation in a given text |
upper_case_word_count |
float | number of uppercase words in a text |
Sentiment |
float | Sentiment score between 1 and 0 |