Framework Overview

Highlights of The Code ✨

Complete code support from preprocessing to results visualization. In particular, the preprocessing code is easy to migrate to other dataset.
Most of the code is commented with detail explanation.
The preprocessing support multi-process to speed up running time.
The large data such as recurrence plots are generated in hard disk using Pytables, rather than RAM, which saves a lot of memory.
The training support single GPU, multi-GPUs, multi-Nodes, and CPU.
All the experiments in tha paper can be run using the shell scripts we wrote.

Requirements

Install PyTorch (pytorch.org)
pip install -r requirements.txt
Download Geolife dataset at: https://www.microsoft.com/en-us/research/project/geolife-buildingsocial-networks-using-human-location-history/#!downloads, then put all user folders under Geolife Trajectories 1.3-Raw-All/Geolife Trajectories 1.3/Data into our project folder ./data/geolife_raw
Download SHL dataset at: http://www.shl-dataset.org/download/, then put all folders under SHLDataset_User1Hips_v1/release/User1 into our project folder ./data/SHL_raw

Project File Structure

Extracted trajectories and labels from raw dataset & Generated handcrafted features and recurrence plots (RPs) are under ./data
Our main Dual-CSA network is under network_torch/.
network_variant/ includes the variants of our Dual-CSA.
ML_comparison/ includes the classical machine learning methods we used to compare.
network_comparison/ includes other competitive deep learning based methods, note that they are implemented using keras.
exp_scripts/ includes all shell scripts we used to conduct experiments.
keras_support_old/ is the implementations of our model using keras, note we will not maintain keras version of our model any more.
results/ all training and predict results will be saved at this place.
visualization_and_analysis/ includes python code to draw some charts of experimental results in the paper.

Detail Usage for Each File

Before running the code, please set the environment variable of results path in the terminal, for example:

$ results_path=./results/exp1
$ export RES_PATH=${results_path}

Please first run trajectory_extraction_geolife.py and trajectory_extraction_SHL.py to generate train & test trajectory and label .npy file under ./data/*_extracted.
trajectory_segmentation_and_features_extraction.py is used to segment the trajectory and extract movement features (MFs) and auxiliary features (AFs) for train or test set. For example:

python ./trajectory_segmentation_and_features_extraction.py --trjs_path ./data/geolife_extracted/trjs_train.npy --labels_path ./data/geolife_extracted/labels_train.npy --seg_size 200 --data_type train --save_dir ./data/geolife_features

where seg_size is max number of points in a segment.

MF_RP_mat_h5support.py is used to generate RPs for feature segments generated above. For example:

python ./MF_RP_mat_h5support.py --dim 3 --tau 8 --multi_feature_segs_path ./data/geolife_features/multi_feature_segs_train.npy --save_path ./data/geolife_features/multi_channel_RP_mats_train.h5

where dim is embedding dimension & tau is the time delay in phase space reconstruction.

PEDCC.py is used to generate the predefined evenly-distributed class centroids using the code of paper A Classification Supervised Auto-Encoder Based on Predefined Evenly-Distributed Class Centroids. For example:

python ./PEDCC.py --save_dir ./data/geolife_features --dim 304

where dim is embedding dimension in latent space.

network_training.py is the training code of our model, our network can be trained on GPU, CPU, multi-GPU and multi-Nodes, its detail usage are:

usage: network_training.py [-h] --dataset DATASET [--network NETWORK]
                           [--results-path RESULTS_PATH]
                           [--RP-emb-dim RP_EMB_DIM] [--FS-emb-dim FS_EMB_DIM]
                           [--patience PATIENCE]
                           [--training-strategy TRAINING_STRATEGY]
                           [--no-save-model] [--visualize-emb VISUALIZE_EMB]
                           [--n-features N_FEATURES] [-j N]
                           [--pretrain-epochs N] [--joint-train-epochs N]
                           [-b N] [--wd W] [-p N] [--resume PATH] [-e]
                           [--pretrained] [--world-size WORLD_SIZE]
                           [--rank RANK] [--dist-url DIST_URL]
                           [--dist-backend DIST_BACKEND] [--seed SEED]
                           [--gpu GPU] [--multiprocessing-distributed]

DCSA_Training

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET     geolife or SHL
  --network NETWORK     default is Dual_CSA, can be the variants: CSA-RP, CSA-
                        FS, and Dual-CA-Softmax
  --results-path RESULTS_PATH
                        path to save the training and predict results
  --RP-emb-dim RP_EMB_DIM
                        embedding dimension of RP autoencoder in latent space
  --FS-emb-dim FS_EMB_DIM
                        embedding dimension of FS autoencoder in latent space
  --patience PATIENCE   patience of early stop in joint training
  --training-strategy TRAINING_STRATEGY
                        can be: normal_training, normal_only_pretraining,
                        no_pre_joint_training, and only_joint_training
  --no-save-model       this flag will not save model
  --visualize-emb VISUALIZE_EMB
                        this flag will turn on save latent visualization
                        images every visualize_emb epochs
  --n-features N_FEATURES
                        number of MFs and AFs used, default 5
  -j N, --workers N     number of data loading workers (default: 4)
  --pretrain-epochs N   number of pretraining epochs to run
  --joint-train-epochs N
                        number of joint training epochs to run
  -b N, --batch-size N  mini-batch size (default: 256), this is the total
                        batch size of all GPUs on the current node when using
                        Data Parallel or Distributed Data Parallel
  --wd W, --weight-decay W
                        weight decay (default: 1e-4)
  -p N, --print-freq N  print frequency (default: 4)
  --resume PATH         path to latest checkpoint (default: none)
  -e, --evaluate        evaluate model on testation set
  --pretrained          use pre-trained model
  --world-size WORLD_SIZE
                        number of nodes for distributed training
  --rank RANK           node rank for distributed training
  --dist-url DIST_URL   url used to set up distributed training
  --dist-backend DIST_BACKEND
                        distributed backend
  --seed SEED           seed for initializing training.
  --gpu GPU             GPU id to use.
  --multiprocessing-distributed
                        Use multi-processing distributed training to launch N
                        processes per node, which has N GPUs. This is the
                        fastest way to use PyTorch for either single node or
                        multi node data parallel training

Example usage for single nodes, one or multiple GPUs:

python network_training.py --dataset SHL --results-path ./results/exp1  --RP-emb-dim 152 --FS-emb-dim 152 --patience 20 --dist-url tcp://127.0.0.1:6666 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 -b 230

Example usage for multiple nodes:

Node 0:

python network_training.py --dataset SHL --results-path ./results/exp1  --RP-emb-dim 152 --FS-emb-dim 152 --patience 20 --dist-url tcp://127.0.0.1:6666 --dist-backend nccl --multiprocessing-distributed --world-size 2 --rank 0 -b 230

Node 1:

python network_training.py --dataset SHL --results-path ./results/exp1  --RP-emb-dim 152 --FS-emb-dim 152 --patience 20 --dist-url tcp://127.0.0.1:6666 --dist-backend nccl --multiprocessing-distributed --world-size 2 --rank 0 -b 230

Example usage for CPU (Slow):

python network_training.py --dataset SHL --results-path ./results/exp1  --RP-emb-dim 152 --FS-emb-dim 152 --patience 20 -b 230

Use as a Pipeline Script

This is a shell script run under exp_scripts

cd ..
dataset='geolife'
results_path=./results/exp
export RES_PATH=${results_path}
python ./trajectory_segmentation_and_features_extraction.py --trjs_path ./data/SHL_extracted/trjs_train.npy --labels_path ./data/SHL_extracted/labels_train.npy --seg_size 200 --data_type train --save_dir ./data/SHL_features
python ./trajectory_segmentation_and_features_extraction.py --trjs_path ./data/SHL_extracted/trjs_test.npy --labels_path ./data/SHL_extracted/labels_test.npy --seg_size 200 --data_type test --save_dir ./data/SHL_features
python ./MF_RP_mat_h5support.py --dim 3 --tau 8 --multi_feature_segs_path ./data/SHL_features/multi_feature_segs_train.npy --save_path ./data/SHL_features/multi_channel_RP_mats_train.h5
python ./MF_RP_mat_h5support.py --dim 3 --tau 8 --multi_feature_segs_path ./data/SHL_features/multi_feature_segs_test.npy --save_path ./data/SHL_features/multi_channel_RP_mats_test.h5
python ./PEDCC.py --save_dir ./data/SHL_features --dim 304
python network_training.py --dataset ${dataset} --results-path ${results_path}  --RP-emb-dim 152 --FS-emb-dim 152 --patience 20 --dist-url tcp://127.0.0.1:6666 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 -b 230