This project provides a comprehensive Python pipeline to process and analyze MTA bus ridership data. The pipeline includes data loading, cleaning, transformation, and storage, along with an integration of Node.js for additional tasks.
This is a fork of James Pizzurro's MTA Bus Ridership Scraper that essentially acts as a python wrapper to run the scraper and process the data.
The scraper is a Node.js script that uses Puppeteer to automate the process of downloading the MTA bus ridership data. The scraper is run as part of the data pipeline, and the data is processed and saved to a CSV file.
This fork adds steps to clean the data, standardize the names of the bus routes, convert dates to a standard format, and calculate the average daily ridership for each route by month.
The raw data is found at www.mta.maryland.gov/performance-improvement
After scaping the data, the raw data is stored in the data/raw
directory. The processed data is stored in the data/processed
directory.
data/raw/mta_bus_ridership.csv
is not committed to the repo, but looks like this:
Date | Route | Ridership |
---|---|---|
01/2023 | 103 | 3916 |
01/2023 | 105 | 3530 |
01/2023 | 115 | 4179 |
01/2023 | 120 | 3887 |
01/2023 | 150 | 1833 |
The processed data is stored in data/processed/mta_bus_ridership.csv
and looks like this:
route | date | date_end | ridership | num_days_in_month | ridership_per_day |
---|---|---|---|---|---|
103 | 2018-01-01 | 2018-01-31 | 8369.0 | 31 | 269.96774193548384 |
103 | 2018-02-01 | 2018-02-28 | 9338.0 | 28 | 333.5 |
103 | 2018-03-01 | 2018-03-31 | 8372.0 | 31 | 270.06451612903226 |
103 | 2018-04-01 | 2018-04-30 | 8970.0 | 30 | 299.0 |
103 | 2018-05-01 | 2018-05-31 | 9570.0 | 31 | 308.7096774193548 |
To set up the project, run the setup.sh
script. This script will check for Conda and Node.js installations, create a Conda environment, install Python dependencies, and set up the necessary directories.
./setup.sh
To run the pipeline, execute main.py. This script orchestrates the data transformation process, including data loading, processing, and saving the processed data to a CSV file.
python main.py
main.py
- Data Cleaning: Standardizes column names in the data.
- Data Processing: Includes functions for route processing and date-related calculations.
- Data Transformation: Applies data processing tasks to the dataset.
- Data Saving: Outputs the processed data to a CSV file.
- Node.js Integration: Runs a Node.js script as part of the data pipeline.
setup.sh
- Sets up the Python environment.
- Installs Python and Node.js dependencies.
- Prepares the project directory structure.
- Python 3.11
- Pandas
- Prefect (For running the pipeline)
- Node.js (To run the browser automation script)
project/
│
├── main.py - Main script for running the data pipeline.
├── setup.sh - Script for setting up the environment and dependencies.
├── data/
│ ├── raw/ - Directory for storing raw data files.
│ └── processed/ - Directory for storing processed data.
└── node/
└── index.js - Node.js script for scraping the MTA bus ridership data.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
GNU General Public License v3.0
James Pizzurro - MTA Bus Ridership Scraper