Updated on 2017-04-29: New data (demand.h5, holiday.txt,...) uplaoded to onedrive
Updated on 2017-04-12: Weather data (Meteorology.h5) uploaded to onedrive
The draft is constantly being updated on onedrive.
Updated on 2017-03-21: HDF5 and Holiday data uploaded to onedrive
- Weather: https://www.ncdc.noaa.gov/qclcd/QCLCD. A batch order is available through https://www.ncdc.noaa.gov/cdo-web/datasets
- NYC Taxi: http://www1.nyc.gov/html/tlc/html/about/trip_record_data.shtml
The generated data is derived using 2 years' raw yellow taxi data (from 2014-07-01 to 2016-06-30). For now, we used only 6 months' raw data with a total size of 10 GB. The data generation process (designed and implemented in a Mapreduce workflow) takes 2.5 hours (for processing two years' data). This process could be done on a cluster (need to contact Columbia HPC) using the same code.
Demand.mat: the Generated Data stored in the format of Matlab Binary File. It contains two variables: a time table 'Demand' and a Georeference object 'R'.
R: a Georeference object which gives geo information such as geo range
R=
Latitude Limits: [40.6769, 40.8868]
Longitude Limits: [-74.0411, -73.9073]
Raster Size: [32, 32]
Raster Interpretation: Rectangular Cells
Columns StartFrom: 'north'
Rows StartFrom: 'west'
Cell Edge Length In Latitude: 0.00723660362598688
Edge Length In Longitude: 0.00955245263273959
Coordinate System Type: 'WGS84'
Unit: 'degree'
...
Demand: a time table ranging from 2014-07-01 00:00:00 to 2016-06-30 23:00:00 with an interval of 1 hour
Demand =
time demand
___________________ ________________
2016-01-01 00:00:00 [32×32×2 double]
2016-01-01 01:00:00 [32×32×2 double]
... ...
For example: Demand.demand{1}(:,:,1)
is a [32×32 double] matrix corresponding to Demand.time(1)
:2014-07-01 00:00:00. It is the number of persons who are picked up in each rectangular cell within Manhattan (defined by Manhattan Boundary) counted from 2014-07-01 00:00:00 till 2014-07-01 00:59:59. Similarly, Demand.demand{1}(:,:,2)
is the number of persons dropped within Manhattan during the same period of time.
demand.h5: the same data rearranged and stored in hdf5 format. It contains two datasets: 'demand_tensor' and 'datetime'
demand_tensor: a 32x32x2x17545 double 4-D tenor. The last dimension corresponds to time. demand_tensor[:][:][0][i] is the pickup matrix at time datetime[i];
datetime: a 17545x1 string timestampe.