Research Request - download NTD 2022 data
Closed this issue · 2 comments
tiffanychu90 commented
Complete the below when receiving a research request, and continue to add to this issue as you receive additional details and produce deliverables. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).
Research Question
Single sentence description: Download the latest year...2022 NTD data
Detailed description:
- warehouse raw source table
- warehouse mart table
- Notebook for exploring whether
ntd_id
changed quite a bit from 2021 data - Slack thread
Deliverables
- Identified relevant scripts in
data-infra
with DAG that calls this scrape_ntd script - Notebook for
ntd_id
comparison - Depending on results of
ntd_id
...more work to reconcile ids, or we can append to our mart table, and also updatedim_organizations
- GCS bucket:
calitp-ntd-data-products
andtest-calitp-ntd-data-products
...what's the difference?
tiffanychu90 commented
- Added this
_env
file and loaded the bucket name in forCALITP_BUCKET__NTD_DATA_PRODUCTS
: - added
import dotenv
anddotenv.load_dotenv("_env")
- in terminal:
python scrape_ntd.py annual-database-agency-information 2022 https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-10/2022%20Agency%20Information_2.xlsx
- make sure to grab the Excel url (like above), not just the website url
tiffanychu90 commented
- Debugging errors that come up Slack thread
- dbt seed file is a csv crosswalk -- we do need one
- there are some that need to be manually reconciled via analyst helping create this crosswalk
- notebook shows more stringent merges on id and name will result in many not matching
- but, if we loosen the merge columns and parse
ntd_id
into a prefix/suffix portion and merge on the parsed...a lot of these will merge between 2021 and 2022 - create new column called
ntd_id_2022
- create reporter table, which is
ntd_id-reported_by_name-year
...which will sit between the raw export with what we want to show inmart_ntd
- reduce
mart_ntd
to 1 row perntd_id-year
- further exploration: if in 2022, there are 2 separate
reported_by_name
rows, in the past, was it just 1? why do multiple appear now? On NTD site, there is a new definition forntd_id
, where the one submitting does have a prefix attached - Assign to analyst: seed file:
2021 ntd_id
to2022 ntd_id
...can get at it in notebook, but ultimately distill down to 2 columns to use in seed file - reporter matching we can get via joins after that