Repository for the first project in the course ' Natural Language Processing ' at Northwestern University
This is the first project for Northwestern's COMP337 - Natural Language Processing class. We are tasked to extract several information (such as hosts, awards, nominees, presenters, and winners) about each year's Golden Globes ceremony based on more than 170,000 tweets. By default, this project is intended to extract tweets that discusses the Golden Globe ceremony (year=2013), but the code can also be migrated to extract information for other ceremonies.
We mainly summarizes our work in three python files: gg_api.py
, util.py
, and global_var.py
:
gg_api.py
: this file is the main program that will be called if you want to generate a human readable output on hosts, awards, nominees, presenters, and winners, and save the results intoresult_{year}.json
file, where you should specify the year.util.py
: this file stores all the helper functions that we use for extracting the hosts, awards, nominees, etc.global_var.py
: this file stores the global constants that we use for information extraction, such as a list of strings that represent the "ground truth" awards for that ceremony; stopwords used for mining awards, nominees, presenters, etc; and regular expressions for tweet pre-processing.
Depends on the machine that you would run this program on, the running time (pre-process + information extraction) would vary between 8 - 15 minutes.
Through out the design process of this project, we are mainly following on the four steps of run-time structure:
- Extraction
- Clustering
- Applying Constraints
- Aggregation
Almost all our functions follow the idea of the above four steps:
- Extraction (Pre-processing): Within
pre-ceremony()
, we first pre-process the entire tweets. We perform data cleaning using the specified common stopwords for tweets to filter out abbreviations and slangs. We also remove emojis, punctuations, hashtags, tags, and links. - Clustering: We have written helper functions that uses regex expressions and keywords matching to cluster tweets with the most matching relevant award category. We also applied fuzzy matching that gets a probability score of which tweet that contains name matches the most with the award category (used in
get_winner(year)
function) - Applying Constraints: after clustering, we then apply a large amount of regular expression and string matching trying to extract useful information within the given tweet. By useful information we mean noun phrases that can correspond to human names, award names, or movie (series) names.
- Aggregation: Lastly, we take the relatively clean data extracted and apply aggregation processes. We merge or discard similar names that may refer to the same person or same movie. We take the result and search in imdb library to get the final list. We then output our results in two forms: human readable form printed to the console, and a json file.
- We use Python 3.10 for experiment.
- Please refter to requirement.txt to install related modules.
- Please download our code as a zip file
- You can also clone our git repo
git clone https://github.com/Tizzzzy/CS337_northwestern.git
-
Create a virtual environment if needed and activate the environment
-
Install all required dependency
pip install -r requirements.txt
-
Put
ggYYYY.json
files in the root directory, such asgg2013.json
orgg2015.json
-
Change
output_dir
to your own directory path inbest_dress
andworst_dress
functions -
In the
main()
ofgg_api.py
, changeyear = 2013
to other years if needed (Note, if running ceremonies other than the Golden Globes, change the string inprocess_json(year) within the
util.py` to the filename you have -
Within the file
global_var.py
, change the constantOFFICIAL_AWARDS
to that specific year's ground truth awards -
Run
gg_api.py
to get the results (Note that, when running gg_api.py alone, you can comment out the lines:df = process_json(year) df['text'] = df['text'].apply(remove_stopwords)
to speed out the running proces, since we have defined a global variable
df
to store the pre-processed dataframe. For the functionality ofautograder.py
, we keep these lines sinceautograder.py
individually calls each function, so it requires to read-in and pre-process the json file.python gg_api.py
-
Run
autograder.py
to get completeness and spelling scorespython autograder.py
The code will produce two result files. First, it creates a results.json
which contains the results for the autograder. The gg_api.py
will automatically read the content of that file and feed it into the autograder. Secondly, we are creating a results.md
file which is human readable and contains the same results. Furthermore, we added some visualizations to that file which show the results for the additional goals we had.
We achieved the following autograder scores on completeness and spelling:
Hosts | Awards | Winners | Presenters | Nominees | |
---|---|---|---|---|---|
Spelling | 1.0 | 0.8017327400895528 | 0.5692918192918193 | 0.5 | 0.511293567251462 |
Completeness | 1.0 | 0.14782608695652172 | 0.5 | 0.02523999905687549 |
- James Cameron (0.1048158640226629)
- Anne Hathaway (0.0708215297450425)
- Tina Fey Amy (0.06326723323890462)
- Taylor Swift (0.05193578847969783)
- Jennifer Lawrence (0.04343720491029273)
- Anne Hathaway (0.14887640449438203)
- Daniel Day (0.07584269662921349)
- Les Miserables (0.07303370786516854)
- Les Mis (0.05056179775280899)
- Hugh Jackman (0.03651685393258427)