junxu1226/rnn-lstm-nyc-taxi

dataset parameters to reproduce paper results

Opened this issue · 0 comments

I would like to reproduce the results in the paper and I am having problems creating a dataset with 6424 = Number of areas.

I gather the data as so:
for year in [2013,2014,2015,2016]:
for month in list(range(1,13)):
for color in ['yellow', 'green']:
cmd = ['wget','https://s3.amazonaws.com/nyc-tlc/trip+data/'+str(color)+'tripdata'+str(year)+'-'+f"{month:02}"+'.csv']
print(str(cmd))
cmd_list.append(cmd)

I run the cmd_list ...

I remove data after June 2016. I organized the data in directories by year.

I run process_rawdata.py on each year directory with line 21 uncommented and line 22 commented such that this is run:

df_each = (pd.read_csv(f, index_col = False, header=0, usecols=[1, 5, 6], parse_dates=[0], infer_datetime_format=True, names=['Pickup_datetime', 'Pickup_longitude', 'Pickup_latitude'], memory_map=True) for f in all_files)

After .csv files are created for each year I run make_dataset.py acorrdingly and concatenate the yearly geohash arrays but my
sorted_unique_geohash = sorted_unique_geohash[num_pickups > 200]
sorted_unique_geohash.sort()
print "after dropping small pickups (number of features): " + str(sorted_unique_geohash.shape)

is never 6424 it is always larger.

I tried only using yellow cabs and still got a larger amount.

Any ideas?
A2