This is a pipeline that scrapes the url for the most recent National Transit Database excel workbook, and transforms it into a local sqlite database
- os
- re
- sqlite3
- yaml
- tempfile
- pandas
- requests
- bs4 (BeautifulSoup)
- prefect
- sqlalchemy
- janitor
- tqdm
- math
- Configurations are loaded from a YAML file
conf/main.yaml
.
scrape_monthly_ridership_url(url)
: Scrapes the given URL to find and return the absolute URL, file name, and month-year of the ridership Excel file.
download_excel_workbook(file_url, output_dir)
: Downloads the Excel workbook from the provided URL to the specified output directory.
read_excel_workbook(file_path, sheets)
: Reads the specified sheets from an Excel workbook.read_sheet_from_excel(xl, sheet_name, sheet_index)
: Reads a specific sheet from the Excel workbook.transform_data(df, value_name)
: Transforms the data in the dataframe according to the configurations specified.
merge_transformed_data(dfs)
: Merges a list of dataframes based on specified ID columns.save_data_to_intermediate_file(df, sheet_name, output_dir)
: Saves the provided dataframe to an intermediate Parquet file.
- Class
AgencyModeMonth
: Represents a table in a database for storing monthly ridership data. save_data_to_database(df, db_path)
: Saves the dataframe to a SQLite database.transform_to_star_schema(db_path)
: Transforms the data into a star schema for database storage.
scrape_download_flow(url)
: A Prefect flow for scraping and downloading data.transform_merge_upload_flow(file_path)
: A Prefect flow for transforming, merging, and uploading data.
- Ensure all required libraries are installed.
- Configure the
conf/main.yaml
file as per requirements. - Run the script to perform the data scraping, transformation, and storage operations.
- This script requires a proper configuration file and external data sources to function correctly.
- It is designed to handle specific data formats and may need adjustments for different data structures.