NYC dataset changed format and S3 url
Opened this issue · 1 comments
NYC.gov has changed all their files to Parquet. The csv files are no longer available through the provided S3 links.
The new link is https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet
But it requires some additional processing to follow a long. This mostly applies to video DE Zoomcamp 1.2.2 - Ingesting NY Taxi Data to Postgres, but it may pop up in other places throughout the course.
First
pip install pyarrow
Then convert the parquet to pandas:
import pyarrow.parquet as pq
trips = pq.read_table('yellow_tripdata_2021-01.parquet')
df = trips.to_pandas()
Finally, run this command and wait. It will take awhile then return a number when it is finished.
df.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', chunksize=100000)
Alternatively, the .csv files could be added to the repo with links to those instead.
Changed again, now the link is: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page#:~:text=January-,Yellow%20Taxi%20Trip%20Records,-(PARQUET)