-
Extract raw data.
-
CSV conversion: use Excel or LibreOffice to load, then save as CSV.
-
You may end up with Excel default line endings (^M in vim); convert in Linux/Mac OS using:
tr "\r" "\n"
-
When building an aggregated dataset, remember to strip headers from the CSV files
cat inputs/boris_station_snapshot.xml | ./parse_stations.rb outputs/boris_stations.csv
csvfix sql_insert -t stations -f 1:logical_terminal,2:full_name src_stations.03.uniq.clean.csv > src_stations.sql
csvfix order -f 9,10 aggregated_data.csv > src_stations.01.orig.csv
csvfix order -f 14,15 aggregated_data.csv >> src_stations.01.orig.csv
cat src_stations.01.orig.csv | sort | uniq > src_stations.02.uniq.csv
- Remove leading zeros - Edit / sed / Google refine
- Change non-numeric terminal IDs to numeric
cat src_stations.02.uniq_edit.csv | sort | uniq > src_stations.03.uniq.clean.csv
csvfix sql_insert -t stations -f 1:logical_terminal,2:full_name src_stations.03.uniq.clean.csv > src_stations.sql
- Remove duplicates caused by changed names (cross reference recent list of stations)
Option 2 yields a number of duplicate names, which are rejected on DB insertion. In each case, there are reasonable replacements in the recent data suggesting that the logical terminal IDs have been retained for modified or renamed stations. The differences are:
- Marylebone Flyover -> Paddington Green Police Station
- Gloucester Slips Car Park -> London Zoo Car Park
- Wapping Lane 2 -> Wapping Lane
- 6 -> Mechanical Workshop PS
Comparing a sample of 10 records between the two suggests the downloaded source can be used reliably as a list of logical terminal IDs. However it should be augmented with additional records from source data, such as NA
, 0
, Tabletop1
.
-
Load into the DB (use SQL workbench or similar)
-
mysql -utfl -f tfl_bike < stations.downloaded.sql
-
mysql -utfl -f tfl_bike < stations.source.sql
Duplicates will be filtered from the second file by the 'force' option.
- Ensure schema is loaded into MySQL DB
- Fix references non-numeric terminal IDs
csvfix edit -f 9,14 -e 's/^Tabletop1$/-2/' aggregated_data.csv > aggregated_data.01.fix_table.csv
csvfix edit -f 9,14 -e 's/^NA$/-1/' aggregated_data.01.fix_table.csv > aggregated_data.02.fixed.csv
- Fix up dates for insertion
csvfix edit -f 7,12 -e "s|\(.*\)|STR_TO_DATE('\1', '%e/%c/%Y %T')|" aggregated_data.02.fixed.csv > aggregated_data.03.date_func.csv
- Use csvfix to generate SQL statements.
csvfix sql_insert -t journeys -f 1:rental_id,2:billable_duration,3:duration,4:customer_record_number,5:subscription_id,6:bike_id,7:end_timestamp,8:end_station_id,9:end_station_logical_term,11:end_station_priority_id,12:start_timestamp,13:start_station_id,14:start_station_logical_term,16:start_station_priority_id,17:end_hour_category_id,18:start_hour_category_id,19:bike_user_type_id -nq 7,11 aggregated_data.03.date_func.csv > aggregated_data.import.sql
- Load into the DB
mysql -utfl -f tfl_bike < aggregated_data.import.sql > import.out 2>&1 &
- Run through summ_by_X.rb
- Scripts for day versus journey summaries
- Input read from STDIN
- Specify output CSV file as first param
- Day script requires records ordered by timestamp
awk -F"," '{print $4}' inputs/csv/*.csv | sort | uniq -c | sort > outputs/riders/top_customer.ids
- Filter journeys down to manageable set (e.g. top 2k / popular routes; eliminate missing bikes)
- Create edge table: 'Source', 'Target', 'Weight' columns. Other attributes can be added for partitioning, etc.
- Prepare for import - e.g. store as CSV for import into Gephi.
- Extract boris station list for Gephi use. Must have
id
column.
- Open Gephi, go to Data Lab
- Import spreadsheet - straight edge table
- Import spreadsheet - node list
- Check for added duplicates (sort by name, look for blank lat/long)
- Generate area labels (copy using regex). Step 1 == ',.$' ... Step 2 == '[^,].'
- Apply edge and node ranking ...
- Apply clustering, then use to partition ...
Labels in the gephi_project/complete_viz/bojo_viz3_areas_deletions.gephi vizualization have been merged for consistency.
- Move all controls into panels DONE
- Add hover info panel DONE
- Add reset button DONE
- Make #journeys threshold absolute ?
- Try fixed-range scaling for edges (2-20?) DONE
- Try different edge opacity DONE
- Try edge colour range based on weight or time x
- Scale nodes? DONE
- Fix count slider events DONE
- Fix new dataset filtering / reset done
- Finalize 1440 vs 1439 design