cal-itp/data-analyses

Research Request - download NTD 2022 data

Closed this issue · 2 comments

Complete the below when receiving a research request, and continue to add to this issue as you receive additional details and produce deliverables. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Research Question

Single sentence description: Download the latest year...2022 NTD data

Detailed description:

  • warehouse raw source table
  • warehouse mart table
  • Notebook for exploring whether ntd_id changed quite a bit from 2021 data
  • Slack thread

Deliverables

  • Identified relevant scripts in data-infra with DAG that calls this scrape_ntd script
  • Notebook for ntd_id comparison
  • Depending on results of ntd_id...more work to reconcile ids, or we can append to our mart table, and also update dim_organizations
  • GCS bucket: calitp-ntd-data-products and test-calitp-ntd-data-products...what's the difference?
  • Added this _env file and loaded the bucket name in for CALITP_BUCKET__NTD_DATA_PRODUCTS:
  • added import dotenv and dotenv.load_dotenv("_env")
  • in terminal: python scrape_ntd.py annual-database-agency-information 2022 https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-10/2022%20Agency%20Information_2.xlsx
  • make sure to grab the Excel url (like above), not just the website url
  • Debugging errors that come up Slack thread
  • dbt seed file is a csv crosswalk -- we do need one
  • there are some that need to be manually reconciled via analyst helping create this crosswalk
    • notebook shows more stringent merges on id and name will result in many not matching
    • but, if we loosen the merge columns and parse ntd_id into a prefix/suffix portion and merge on the parsed...a lot of these will merge between 2021 and 2022
    • create new column called ntd_id_2022
    • create reporter table, which is ntd_id-reported_by_name-year...which will sit between the raw export with what we want to show in mart_ntd
    • reduce mart_ntd to 1 row per ntd_id-year
  • further exploration: if in 2022, there are 2 separate reported_by_name rows, in the past, was it just 1? why do multiple appear now? On NTD site, there is a new definition for ntd_id, where the one submitting does have a prefix attached
  • Assign to analyst: seed file: 2021 ntd_id to2022 ntd_id...can get at it in notebook, but ultimately distill down to 2 columns to use in seed file
  • reporter matching we can get via joins after that