Team members:
- Jason Qiu
- Jiayi Xu
- Mingyu Jin
This project mines and extracts information from tweets about motion picture award ceremonies.
The sample dataset given in this repository are tweets from the 70th Golden Globe Awards in 2013.
The extracted information includes:
- Host(s) for the ceremony
- Award names
- Award presenters, given official award names
- Award nominees, given official award names
- Award winners, given official award names
We also extracted the following additional information:
- #BestDressed celebrity through a popular vote on the number of mentions
- Clone this repository.
$ git clone git@github.com:jasonqiu212/golden-globes-tweet-mining.git
$ cd golden-globes-tweet-mining
- Install the required Python packages from
requirements.txt
.
$ python3 -m pip install -r requirements.txt
- Download the
spacy
packageen_core_web_sm
.
$ python3 -m spacy download en_core_web_sm
- If you need to change the file names containing the tweets and answers, change the corresponding constants under the
main
function ingg_api.py
.
YEAR = 2013
TWEETS_FILE_NAME = 'gg{}.json'.format(YEAR)
ANSWER_FILE_NAME = 'gg{}answers.json'.format(YEAR)
If you wish to run the extraction on a smaller set of tweets, we created a small representative dataset called
gg2013-subset.json
. To use this smaller dataset, replaceTWEETS_FILE_NAME
with'gg2013-subset.json'
.
- Extracting information from tweets takes a long time. Thus, we included a time limit for running the extraction. If you wish change the time limit for extracting information, change the corresponding constant under the
main
function ingg_api.py
. The current time limit is 15 minutes, or 900 seconds.
TIME_LIMIT = 900
- Extract the information from the tweets.
$ python3 gg_api.py
- A file called
gg{YEAR}results.json
containing the results in a JSON format will be generated in the root directory. A file calledgg{YEAR}results_humanreadable.txt
containing the results in a human readable format will also be generated.
Congratulations! You have successfully mined and extracted information from tweets about award ceremonies.
- Python libraries used: spaCy, ftfy, langdetect, unidecode, editdistance, cinemagoer, nltk