dataset parameters to reproduce paper results
Opened this issue · 0 comments
I would like to reproduce the results in the paper and I am having problems creating a dataset with 6424 = Number of areas.
I gather the data as so:
for year in [2013,2014,2015,2016]:
for month in list(range(1,13)):
for color in ['yellow', 'green']:
cmd = ['wget','https://s3.amazonaws.com/nyc-tlc/trip+data/'+str(color)+'tripdata'+str(year)+'-'+f"{month:02}"+'.csv']
print(str(cmd))
cmd_list.append(cmd)
I run the cmd_list ...
I remove data after June 2016. I organized the data in directories by year.
I run process_rawdata.py on each year directory with line 21 uncommented and line 22 commented such that this is run:
df_each = (pd.read_csv(f, index_col = False, header=0, usecols=[1, 5, 6], parse_dates=[0], infer_datetime_format=True, names=['Pickup_datetime', 'Pickup_longitude', 'Pickup_latitude'], memory_map=True) for f in all_files)
After .csv files are created for each year I run make_dataset.py acorrdingly and concatenate the yearly geohash arrays but my
sorted_unique_geohash = sorted_unique_geohash[num_pickups > 200]
sorted_unique_geohash.sort()
print "after dropping small pickups (number of features): " + str(sorted_unique_geohash.shape)
is never 6424 it is always larger.
I tried only using yellow cabs and still got a larger amount.
Any ideas?
A2