This is a semester-long project for my Probability & Statistics for Computer Science course.
It uses descriptive statistics and data visualizations to analyse the on time performance of New Jersey Transit commuter rail between 2018 and 2020.
This was my first project in Python so I was learning as I went along with it. There is some janky, inefficient code in here that I would have liked to cleanup. I think this is something I'll come back to after final grades are in.
This project was a fun way to learn both Python and probability & statistcs. If I had enough time (or managed my time better) I would have liked to explore using this data set with weather data, and then possibly geographic or GTFS data. I did start to explore probabilities a few commits back but that section was unfinished (and possibly incorrect) so that was omitted in the final submission. That would have been something else interesting to explore
This data set was chosen because I've always been interested in transportation and regional planning. Data like this would be great for predicting future delays, scheduling, and using it alongside other datasets (i.e. weather data like the author suggests). The author scraped data from NJT's DepartureVision realtime status service. I have my own project using SEPTA's real time data and this might inspire me to create something similar.
Disclaimer: This analysis was a learning process made for educational purposes, not for any kind of institutional research. Certain analysis may be inaccurate or faulty.
The on time performance data used for this project can be found here:
https://www.kaggle.com/datasets/pranavbadami/nj-transit-amtrak-nec-performance?resource=download