erichson/ristretto

Sparse PCA and reconstruction error

Closed this issue · 8 comments

I test the accuracy of ristretto.pca.compute_spca by setting:

  • alpha = 0. No L1 regularization.
  • n_components = number of features. No dimension reduction.
  • tol = 1e-32, max_iter=1e6. A large number of iterations.

I assume the result should be same as a standard PCA over matrix X.

Definition for relative error is: Error[i, j] = 100 * Abs(1 - Xreconstructed[i, j] / Xactual[i, j]).
X has no zeros or very small numbers.

However, reconstruction error can be large for some elements. I guess as the covariance matrix becomes more sparse, the spikes in the relative error becomes taller. But I am not sure.

covariance_matrix2
histogram
relative_error

Test code
import ristretto.pca
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky


def make_covariance_matrix(n_feat, zero_prob=0.99):
    shape = (n_feat, n_feat)
    cov_mat = np.random.uniform(0, 1, shape)
    cov_mat *= np.random.choice(2, shape, p=[zero_prob, 1 - zero_prob])
    np.fill_diagonal(cov_mat, 1)
    cov_mat = cov_mat.dot(cov_mat.T)
    return cov_mat


def make_independent_normal_dists(n_feat, n_obs):
    return np.random.normal(0, 1, (n_feat, n_obs))


def make_correlated_normal_dists(n_obs, n_feat, cov_mat=None):
    """
    https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html
    """
    if cov_mat is None:
        cov_mat = make_covariance_matrix(n_feat)
    c = cholesky(cov_mat, lower=True)
    ind_norm = make_independent_normal_dists(n_feat, n_obs)
    return np.dot(c, ind_norm).T, cov_mat


def plot_covariance_mat(cov_mat):
    plt.title('covariance matrix')
    plt.imshow(cov_mat)
    plt.colorbar()
    plt.xlabel('feature index')
    plt.ylabel('feature index')
    plt.show()
    plt.close()


def plot_filtered_error(rel_error, percentile_thres=99.5):
    error_1d = np.ravel(rel_error)
    threshold = np.percentile(rel_error, percentile_thres)
    error_filtered = error_1d[error_1d<threshold]
    plt.title('Histogram of Relative Error below %f quantile' %
              percentile_thres)
    plt.hist(np.ravel(error_filtered), bins=100)
    plt.xlabel('Relative Error')
    plt.ylabel('Count')
    plt.show()
    plt.close()


def main():
    n_obs = 5000  # Number of observations
    n_feat = 50  # Number of features
    n_components = 50  # Number of principal components.

    # Parameters for standard PCA (not sparse).
    sparsity_alpha = 0
    robust = False
    ridge_beta = 0.1

    # Create test data with zero mean along each column
    X, cov_mat = make_correlated_normal_dists(n_obs // 2, n_feat)
    X -= np.mean(X, axis=0)  # Centering each column by its mean.
    X /= np.std(X, axis=0)
    X = np.vstack([X - 4, X + 4])
    # np.random.shuffle(X)  # shuffle rows of X
    plot_covariance_mat(cov_mat)

    # Sparse PCA
    B, A, eigen_values, obj = ristretto.pca.compute_spca(
        X, n_components=n_components, alpha=sparsity_alpha, robust=robust,
        beta=ridge_beta, tol=1e-32, max_iter=1e6)

    # Reconstruction of X by (XB)A^{T}
    recons_x = np.matmul(np.matmul(X, B), np.transpose(A))

    # Calculate relative error
    delta = X - recons_x
    rel_error = np.abs(delta / X) * 100

    # Plot histogram of relative error
    # filter relative error because some are crazy.
    plot_filtered_error(rel_error)

    # Plot relative error
    plt.title('Relative Reconstruction Error (capped at 200%)')
    plt.imshow(rel_error, vmin=0, vmax=200, aspect='auto')
    plt.colorbar()
    plt.xlabel('feature index')
    plt.ylabel('observation index')
    plt.show()
    plt.close()


main()

Sorry. The formula for relative error is a division. When the denominator is close to zero, the error becomes large. I pick a distribution of X that excludes zero.

@hamster-on-wheels just to double check, is your problem solved, or do you like me to have a closer look into it?

Best,
Ben

Thanks for making this sparse pca. It is much faster than mini-batch sparse pca of sci-kit learn, which is too slow for me. I have some questions about the method in regards to my application.

Why do I still get > 200% spikes in "relative error" when:

  1. number of components and number of features are the same
  2. no L1 regularization.

I thought the result should be same as standard PCA.

There should be no error in reconstruction because the standard basis can perfectly reconstruct all observations.

In my application, the covariance matrix is sparse. Will this sparsity cause problem for the sparse PCA method?

Thank you.

The beta is too big. The errors disappear for the n_comp=n_feat case when beta is at the default.

When n_comp = 5, n_feat = 50, I get some 200% relative error again... Is this normal?

Higher sparsity in covariance matrix (zero_prob) seems to lead to taller spikes in relative errors.

@hamster-on-wheels 200% relative error does not sound normal. Can you vary the beta parameter and make it smaller to see if the relative error gets better, while still maintaining a desired amount of sparsity?

Probably have to try this another day. Spikes in reconstruction error don't disappear in several manual tries of beta.

import ristretto.pca
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import cholesky


def make_covariance_matrix(n_feat, zero_prob=0.5):
    shape = (n_feat, n_feat)
    cov_mat = np.random.uniform(0, 1, shape)
    cov_mat *= np.random.choice(2, shape, p=[zero_prob, 1 - zero_prob])
    np.fill_diagonal(cov_mat, 1)
    cov_mat = cov_mat.dot(cov_mat.T)
    return cov_mat


def make_independent_normal_dists(n_feat, n_obs):
    return np.random.normal(0, 1, (n_feat, n_obs))


def make_correlated_normal_dists(n_obs, n_feat, cov_mat=None, zero_prob=0.5):
    """
    https://scipy-cookbook.readthedocs.io/items/CorrelatedRandomSamples.html
    """
    if cov_mat is None:
        cov_mat = make_covariance_matrix(n_feat, zero_prob=zero_prob)
    c = cholesky(cov_mat, lower=True)
    ind_norm = make_independent_normal_dists(n_feat, n_obs)
    return np.dot(c, ind_norm).T, cov_mat


def plot_covariance_mat(cov_mat):
    plt.title('covariance matrix')
    plt.imshow(cov_mat)
    plt.colorbar()
    plt.xlabel('feature index')
    plt.ylabel('feature index')
    plt.show()
    plt.close()


def plot_filtered_error(rel_error, percentile_thres=99.5):
    error_1d = np.ravel(rel_error)
    threshold = np.percentile(rel_error, percentile_thres)
    error_filtered = error_1d[error_1d < threshold]
    plt.title('Histogram of Relative Error below %f quantile' %
              percentile_thres)
    plt.hist(np.ravel(error_filtered), bins=100)
    plt.xlabel('Relative Error')
    plt.ylabel('Count')
    plt.show()
    plt.close()


def main():
    n_obs = 5000  # Number of observations
    n_feat = 50  # Number of features
    n_components = 5  # Number of principal components.

    # Parameters for standard PCA (not sparse).
    sparsity_alpha = 0
    robust = False
    ridge_beta = 1e-16
    zero_prob = 0.95

    # Create test data with zero mean along each column
    X, cov_mat = make_correlated_normal_dists(n_obs // 2, n_feat,
                                              zero_prob=zero_prob)
    X -= np.mean(X, axis=0)  # Centering each column by its mean.
    X /= np.std(X, axis=0)
    X = np.vstack([X - 4, X + 4])
    # np.random.shuffle(X)  # shuffle rows of X
    plot_covariance_mat(cov_mat)

    # Sparse PCA
    B, A, eigen_values, obj = ristretto.pca.compute_spca(
        X, n_components=n_components, alpha=sparsity_alpha, robust=robust,
        beta=ridge_beta, tol=1e-5, max_iter=1e4)

    # Reconstruction of X by (XB)A^{T}
    recons_x = np.matmul(np.matmul(X, B), np.transpose(A))

    # Calculate relative error
    delta = X - recons_x
    rel_error = np.abs(delta / X) * 100

    # Plot histogram of relative error
    # filter relative error because some are crazy.
    plot_filtered_error(rel_error)

    # Plot relative error
    vmax = 400
    plt.title('Relative Reconstruction Error (capped at %d)' % vmax)
    plt.imshow(rel_error, vmin=0, vmax=vmax, aspect='auto')
    plt.colorbar()
    plt.xlabel('feature index')
    plt.ylabel('observation index')
    plt.show()
    plt.close()


main()

@hamster-on-wheels I can have a look into it later this week. If your input data have outliers, than you should use the robust spca function instead of the spca function. Maybe, that helps to fix the problem.

Changing shifts +-4 to X = np.vstack([X - 10, X + 10]) makes the spikes in error vanish.

Making X a bimodal distribution with the two centers at +-s means the error is delta / s. so the definition of error is meaningless. have to find a better definition of error.