/Sports-Bot

Primary LanguageJupyter NotebookMIT LicenseMIT

Sports-Bot: Twitter Injury Report Detection

App: Sports-Bot-App
Website: Sports-Bot
Presentation: Presentation

This directory contains the twitter bot project submitted to Unicode Research, an online teaching and research organization which provides classes and competitions for students to engage in.

The goal of this project is to use a model that can classify injury reports in Baseball from Twitter data. The best model found during the 8-week course and competition was the pre-trained RoBERTa Neural Network. The long-term goal of this project is to collect injury data possible for all sports players, classify the type of injury and create a website where this data can be displayed.

Among all applicants, this project won 1st place.

Model Data Type / Epochs Sensitivity Specificity Precision F1 Score Accuracy
kNN Boolean 0.2734 0.9965 0.9182 0.4214 0.9058
kNN TF-IDF 0.1957 0.9942 0.8294 0.3167 0.8941
Bernoulli NB Boolean 0.8614 0.9486 0.7061 0.776 0.9377
Multinomial NB Count 0.8614 0.9342 0.6525 0.7425 0.9251
Logistic Regression TF-IDF 0.9373 0.9787 0.8629 0.8986 0.9735
Random Forest Boolean 0.7631 0.9719 0.7959 0.7792 0.9458
Random Forest TF-IDF 0.8502 0.9522 0.7184 0.7787 0.9394
SVM TF-IDF 0.8661 0.9909 0.9315 0.8976 0.9752
LSTM 10 0.9353 0.985 0.9541 0.9466 0.9726
GRU 10 0.9226 0.9864 0.9577 0.9398 0.9705
RoBERTa 5 0.9478 0.9887 0.9664 0.957 0.9782
XLM-RoBERTa 5 0.9648 0.9855 0.9568 0.9608 0.9803
XLNet 5 0.9691 0.9841 0.953 0.9609 0.9803
DistilBERT 5 0.8706 0.9812 0.9393 0.9036 0.9536
DistilBERT FT 5 0.8861 0.9789 0.9333 0.9091 0.9557

The best in each category is bolded, but for our purposed our most important metric is specificity.

The score most valued for our use case was the sensitivity, so we label the best "classical" machine learning model and best neural net model with bold text. All classical models were trained on the full dataset (15,000 datapoints) using stratified sampling, while all Neural Networks were completed on a curated sample of the dataset (7,000 datapoints) to deal with class imbalance issues.

This Folder represents the most up-to-date version of the project. For seeing the project as it looked a week after the Final Presentation, please see the Sports Injury Classification Repository.