/BATF

MATLAB code for 「Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model」.

Primary LanguageMATLABMIT LicenseMIT

Bayesian augmented tensor factorization

This repository provides the Matlab code for the following paper:

  • Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, Jiawei Wang (2019). Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model. Transportation Research Part C: Emerging Technologies, 104: 66-77. [preprint] [doi] [slide] [data] [Matlab code] [Python code]

Project structure

Code of BATF

  • BATF_VB.m: The main code file. Implements variational inference for BATF.
  • Test_BATF.m: The testing file. You can find a testing example here.

Utility functions

  • ms_scenario.m: Generate incomplete tensor with different missing scenarios.
  • cp_combination.m: A function for the CP combination over factor matrices.
  • khatrirao_fast.m: A fast Khatri-Rao product function.
  • kr.m: A Kronecker product function.
  • mat2ten.m: A tensor folding function.
  • ten2mat.m: A tensor unfolding function.
  • vec_combination.m: A function for summing up the bias vectors to a 3-dimensional tensor. (Please consider using a newer version of MATLAB)
  • safelog.m: A safe logarithm function inspired by the work of BCPF.

Experimental dataset

  • tensor.mat: The original speed data. Organized as a third-order tensor (road segment × day × time interval, with a size of 214 × 61 × 144). There are about 1.29% missing values in the raw data set.
  • random_tensor.mat: A synthesis random tensor. Used for generating random missing tensor.
  • random_matrix.mat: A synthesis random matrix. Used for generating fiber missing tensor.

Usage

BATF can be run as follows:

[model] = BATF_VB(dense_tensor,sparse_tensor,'CP_rank',low_rank,'maxiter',max_iteration);

Required Arguments:

  • dense_tensor: The original data tensor.
  • sparse_tensor: The incomplete tensor generated by ms_scenario.m function.

Optional Arguments:

  • 'CP_rank': The low rank of CP factorization.
  • 'maxiter': The maximum number of iterations for implementing variational inference.

Python code is now available!

Please check out Python code of BATF at transdim for detail. In the Python implementation, we use Gibbs sampling for solving the fully Bayesian augmented tensor factorization, it would be not difficult to reproduce if numpy in your Python is available. There are some steps to follow:

Prepare the basic functions

  • Import some necessary packages:
import numpy as np
from numpy.random import multivariate_normal as mvnrnd
from scipy.stats import wishart
from numpy.random import normal as normrnd
from scipy.linalg import khatri_rao as kr_prod
from numpy.linalg import inv as inv
from numpy.linalg import solve as solve
from numpy.linalg import cholesky as cholesky_lower
from scipy.linalg import cholesky as cholesky_upper
from scipy.linalg import solve_triangular as solve_ut
  • Define the function for generating multivariate Gaussian distributed random numbers (very fast here!):
def mvnrnd_pre(mu, Lambda):
    src = normrnd(size = (mu.shape[0],))
    return solve_ut(cholesky_upper(Lambda, overwrite_a = True, check_finite = False), 
                    src, lower = False, check_finite = False, overwrite_b = True) + mu
  • Define CP combination for factor matrices and vector combination for biases:
def cp_combine(var):
    return np.einsum('is, js, ts -> ijt', var[0], var[1], var[2])
    
def vec_combine(vector):
    return (vector[0][:, None, None] + vector[1][None, :, None] + vector[2][None, None, :])
  • Define the tensor unfolding function:
def ten2mat(tensor, mode):
    return np.reshape(np.moveaxis(tensor, mode, 0), (tensor.shape[mode], -1), order = 'F')
  • Define other necessary functions here:
def cov_mat(mat, mat_bar):
    mat = mat - mat_bar
    return mat.T @ mat

def compute_mape(var, var_hat):
    return np.sum(np.abs(var - var_hat) / var) / var.shape[0]

def compute_rmse(var, var_hat):
    return  np.sqrt(np.sum((var - var_hat) ** 2) / var.shape[0])
  • Define the function for sampling global parameter:
def sample_global_mu(mu_sparse, pos_obs, tau_eps, tau0 = 1):
    tau_tilde = 1 / (tau_eps * len(pos_obs[0]) + tau0)
    mu_tilde = tau_eps * np.sum(mu_sparse) * tau_tilde
    return np.random.normal(mu_tilde, np.sqrt(tau_tilde))
  • Define the function for sampling bias vectors:
def sample_bias_vector(bias_sparse, factor, bias, ind, dim, k, tau_eps, tau0 = 1):
    for k in range(len(dim)):
        idx = tuple(filter(lambda x: x != k, range(len(dim))))
        temp = vector.copy()
        temp[k] = np.zeros((dim[k]))
        tau_tilde = 1 / (tau_eps * bias[k] + tau0)
        mu_tilde = tau_eps * np.sum(ind * (bias_sparse - vec_combine(temp)), axis = idx) * tau_tilde
        vector[k] = np.random.normal(mu_tilde, np.sqrt(tau_tilde))
    return vector
  • Define the function for sampling factor matrices:
def sample_factor(tau_sparse, factor, ind, dim, k, tau_eps, beta0 = 1):
    dim, rank = factor[k].shape
    dim = factor[k].shape[0]
    factor_bar = np.mean(factor[k], axis = 0)
    temp = dim / (dim + beta0)
    var_mu_hyper = temp * factor_bar
    var_W_hyper = inv(np.eye(rank) + cov_mat(factor[k], factor_bar) + temp * beta0 * np.outer(factor_bar, factor_bar))
    var_Lambda_hyper = wishart.rvs(df = dim + rank, scale = var_W_hyper)
    var_mu_hyper = mvnrnd_pre(var_mu_hyper, (dim + beta0) * var_Lambda_hyper)
    
    idx = list(filter(lambda x: x != k, range(len(factor))))
    var1 = kr_prod(factor[idx[1]], factor[idx[0]]).T
    var2 = kr_prod(var1, var1)
    var3 = (var2 @ ten2mat(tau_eps * ind, k).T).reshape([rank, rank, dim]) + var_Lambda_hyper[:, :, np.newaxis]
    var4 = var1 @ ten2mat(tau_sparse, k).T + (var_Lambda_hyper @ var_mu_hyper)[:, np.newaxis]
    for i in range(dim):
        factor[k][i, :] = mvnrnd_pre(solve(var3[:, :, i], var4[:, i]), var3[:, :, i])
    return factor[k]
  • Define the function for sampling precision parameter:
def sample_precision_tau(error_tensor, pos_obs):
    var_alpha = 1e-6 + 0.5 * len(pos_obs[0])
    var_beta = 1e-6 + 0.5 * np.linalg.norm(error_tensor, 2) ** 2
    return np.random.gamma(var_alpha, 1 / var_beta)
  • Define the function for BATF by using Gibbs sampling:
def BATF_Gibbs(dense_tensor, sparse_tensor, vector, factor, burn_iter, gibbs_iter):
    """Bayesian Augmented Tensor Factorization (BATF) with Gibbs sampling."""

    dim = np.array(sparse_tensor.shape)
    rank = factor[0].shape[1]
    if np.isnan(sparse_tensor).any() == False:
        ind = sparse_tensor != 0
        pos_obs = np.where(ind)
        pos_test = np.where((dense_tensor != 0) & (sparse_tensor == 0))
    elif np.isnan(sparse_tensor).any() == True:
        pos_test = np.where((dense_tensor != 0) & (np.isnan(sparse_tensor)))
        ind = ~np.isnan(sparse_tensor)
        pos_obs = np.where(ind)
        sparse_tensor[np.isnan(sparse_tensor)] = 0
    num_obs = len(pos_obs[0])
    dense_test = dense_tensor[pos_test]
    del dense_tensor

    show_iter = 200
    tau_eps = 1
    bias = []
    for k in range(len(dim)):
        idx = tuple(filter(lambda x: x != k, range(len(dim))))
        bias.append(np.sum(ind, axis = idx))
    temp = cp_combine(factor)
    temp_hat = np.zeros(len(pos_test[0]))
    tensor_hat_plus = np.zeros(dim)
    for it in range(burn_iter + gibbs_iter):
        temp = sparse_tensor - temp
        mu_glb = sample_global_mu(temp[pos_obs] - vec_combine(vector)[pos_obs], pos_obs, tau_eps)
        vector = sample_bias_vector(temp - mu_glb, factor, bias, ind, dim, k, tau_eps)
        del temp
        tau_sparse = tau_eps * ind * (sparse_tensor - mu_glb - vec_combine(vector))
        for k in range(len(dim)):
            factor[k] = sample_factor(tau_sparse, factor, ind, dim, k, tau_eps)
        temp = cp_combine(factor)
        tensor_hat = mu_glb + vec_combine(vector) + temp
        temp_hat += tensor_hat[pos_test]
        tau_eps = sample_precision_tau(sparse_tensor[pos_obs] - tensor_hat[pos_obs], pos_obs)
        if it + 1 > burn_iter:
            tensor_hat_plus += tensor_hat
        if (it + 1) % show_iter == 0 and it < burn_iter:
            temp_hat = temp_hat / show_iter
            print('Iter: {}'.format(it + 1))
            print('MAPE: {:.6}'.format(compute_mape(dense_test, temp_hat)))
            print('RMSE: {:.6}'.format(compute_rmse(dense_test, temp_hat)))
            temp_hat = np.zeros(len(pos_test[0]))
            print()
    tensor_hat = tensor_hat_plus / gibbs_iter
    print('Imputation MAPE: {:.6}'.format(compute_mape(dense_test, tensor_hat[pos_test])))
    print('Imputation RMSE: {:.6}'.format(compute_rmse(dense_test, tensor_hat[pos_test])))
    print()

    return tensor_hat, mu_glb, vector, factor

Start your evaluation

  • Import traffic data:
import scipy.io

dense_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/tensor.mat')['tensor']
random_tensor = scipy.io.loadmat('../datasets/Guangzhou-data-set/random_tensor.mat')['random_tensor']
missing_rate = 0.4

## Random missing (RM)
binary_tensor = np.round(random_tensor + 0.5 - missing_rate)
sparse_tensor = np.multiply(dense_tensor, binary_tensor)
  • Perform missing data imputation with BATF:
import time
start = time.time()
dim = np.array(sparse_tensor.shape)
rank = 80
vector = []
factor = []
for k in range(len(dim)):
    vector.append(0.1 * np.random.randn(dim[k],))
    factor.append(0.1 * np.random.randn(dim[k], rank))
burn_iter = 1000
gibbs_iter = 200
BATF_Gibbs(dense_tensor, sparse_tensor, vector, factor, burn_iter, gibbs_iter)
end = time.time()
print('Running time: %d seconds'%(end - start))
  • Check out the expected results:
Iter: 200
MAPE: 0.08303
RMSE: 3.58407

Iter: 400
MAPE: 0.083102
RMSE: 3.59169

Iter: 600
MAPE: 0.0830566
RMSE: 3.59025

Iter: 800
MAPE: 0.0829992
RMSE: 3.58897

Iter: 1000
MAPE: 0.0829348
RMSE: 3.58643

Imputation MAPE: 0.0828811
Imputation RMSE: 3.5845

Running time: 3592 seconds

Citation

If you find the project useful for your publication, please consider citing our publicly available dataset and paper:

  • Urban Traffic Speed Dataset of Guangzhou, China (Available at https://doi.org/10.5281/zenodo.1205229)

    @Misc{XinyuChen2018Urban,
      author    = {{Xinyu Chen} and {Yixian Chen} and {Zhaocheng He}},
      title     = {Urban Traffic Speed Dataset of Guangzhou, China},
      year      = {2018},
      doi       = {10.5281/zenodo.1205228},
      keywords  = {intelligent transportation systems, urban traffic speed data, urban traffic data analytics, missing data imputation, short-term traffic prediction, traffic pattern discovery},
      publisher = {Zenodo},
    }
  • Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model (Available at https://doi.org/10.1016/j.trc.2019.03.003)

    @Article{Chen2019Missing,
      author    = {Xinyu Chen and Zhaocheng He and Yixian Chen and Yuhuan Lu and Jiawei Wang},
      title     = {Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model},
      journal   = {Transportation Research Part C: Emerging Technologies},
      year      = {2019},
      volume    = {104},
      pages     = {66--77},
      month     = {jul},
      doi       = {10.1016/j.trc.2019.03.003},
      publisher = {Elsevier {BV}},
    }

License

The MIT License (MIT)