Detail each research question you intend to answer.
- What time of the year do students experience the most stress/anxiety?
- Do exams/deadlines affect student’s mental health?
- Does holidays/spring break improve student’s mental health?
- Kaggle Mental Health datasets
- Scraping relevant subreddits on Reddit.
- Social media in general: Twitter/Reddit/Instagram..etc
- Data Collection: Retrieving data from Kaggle, scraping from Reddit, utilizing Reddit API, scraping from social media
- Data Exploration: We do need to perform EDA, as a method to remove outliers and anomalies
- Data Cleaning: Yes, data cleaning is necessary. Scrapped data contains noise which needs to be cleaned and removed.
- Data Integration: Data integration is necessary because we will be using scraped data and pre-made datasets.
- Data Analysis: We intend to use machine learning and NLP to analyze our data. We plan to set a confidence interval to evaluate our analysis results.
- Data Product: Interactive Visualization
The completion of this project is anticipated to have a significant impact on understanding the dynamics of student mental health throughout the academic year, specifically about stress, anxiety, and the effects of academic pressures versus breaks. The greatest impact of this study could be the development of a comprehensive understanding of when students are most vulnerable to stress and anxiety and the identification of potential periods of relief. We can then use these findings to better provide mental health support to students.
- We might encounter issues related to scraping. Social media platforms like Twitter have implemented strict limits on scraping, which we might need to find a way to get around it. If necessary, we can pay a small fee to utilize different social media’s official API to gather data.
- Data quality might be a concern. Posts on subreddits like r/anxiety may not always be related to stress, there's also the risk of encountering off-topic posts or spam. To combat this, we might need to manually label data, or use other metrics such as upvote count to determine if the data is relevant or not.