The Skating Shift is a Data Mining Project for CSCI 5502 at CU Boulder. The purpose of the project was to understand the large increase of rollerskating sales during the 2020 COVID-19 pandemic. The analysis was peformed by scraping 5 years worth of Twitter Data from keywords such as "Rollerskating", "Rollerskates", and "Rollerblades" from the years 2016 to 2020.
The data was analysized based off of trends historically, sentiment analysis, and keyword search.
Dev Containers allows users to run their development environment inside of a Docker container while using Visual Studio code. A dev container is helpful because we don't have to worry about having different versions of Python installed on our machine. Instead, we can specify what versions of everything to run inside of our Docker container and focus on our code. This is possible because of the Remote Containers VS Code Extension provided by Microsoft.
- Press F5 to launch the app in the container
- Press F1 to run the Forward a port command
App.py contains the data scraping functionality. I used two libraries to scrape Tweets: Tweepy and Snscrape. Tweepy only allowed me to scrape 7 days of historical data from their Search Index, so I instead resorted to Snscrape so I could get data from the past 5 years. Both functions are included in this python file for Tweepy and Snscrape.
- Twitter Developer Account Needed to gain access to Twitter API
- Tweepy Python library for accessing the Twitter API
- Snscrape Python library for scrapping large amounts of historical tweets off Twitter
I installed the following packages onto my dev container to help me with Data Analysis
- Jupyter Jupyter Notebooks to perform data analysis using brief Python snippits of code
- MatPlotLib Used to create different visualizations based on my findings
- Pandas Python package for data manipulation and analysis
- Numpy Python package for operating on large multi dimensional arrays and matrices
- TextBlob Helped me analysize sentiment from my scraped tweets with polarity and subjectivity scoring
I wanted to understand the following:
- Historical patterns of Rollerskating
- Possible correlations with keywords including "COVID"
- Comparison between wheeled sports such as "Inline" and "Quad" Skating
- Sentiment of Rollerskating during 2020 and with keywords such as "COVID"
Please see my Jupyter Files for in depth Analysis