Research Request - download NTD 2022 data

Question

Research Request - download NTD 2022 data

Closed this issue 5 months ago · 2 comments

Complete the below when receiving a research request, and continue to add to this issue as you receive additional details and produce deliverables. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Research Question

Single sentence description: Download the latest year...2022 NTD data

Detailed description:

warehouse raw source table
warehouse mart table
Notebook for exploring whether ntd_id changed quite a bit from 2021 data
Slack thread

Deliverables

Identified relevant scripts in data-infra with DAG that calls this scrape_ntd script
Notebook for ntd_id comparison
Depending on results of ntd_id...more work to reconcile ids, or we can append to our mart table, and also update dim_organizations
GCS bucket: calitp-ntd-data-products and test-calitp-ntd-data-products...what's the difference?

Answer 1 · 2023-11-15T22:32:37.000Z

Added this _env file and loaded the bucket name in for CALITP_BUCKET__NTD_DATA_PRODUCTS:
added import dotenv and dotenv.load_dotenv("_env")
in terminal: python scrape_ntd.py annual-database-agency-information 2022 https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-10/2022%20Agency%20Information_2.xlsx
make sure to grab the Excel url (like above), not just the website url

Answer 2 · 2023-12-05T15:52:17.000Z

Debugging errors that come up Slack thread
dbt seed file is a csv crosswalk -- we do need one
there are some that need to be manually reconciled via analyst helping create this crosswalk
- notebook shows more stringent merges on id and name will result in many not matching
- but, if we loosen the merge columns and parse ntd_id into a prefix/suffix portion and merge on the parsed...a lot of these will merge between 2021 and 2022
- create new column called ntd_id_2022
- create reporter table, which is ntd_id-reported_by_name-year...which will sit between the raw export with what we want to show in mart_ntd
- reduce mart_ntd to 1 row per ntd_id-year
further exploration: if in 2022, there are 2 separate reported_by_name rows, in the past, was it just 1? why do multiple appear now? On NTD site, there is a new definition for ntd_id, where the one submitting does have a prefix attached
Assign to analyst: seed file: 2021 ntd_id to2022 ntd_id...can get at it in notebook, but ultimately distill down to 2 columns to use in seed file
reporter matching we can get via joins after that