/EigenPro-matlab

EigenPro iteration in MATLAB

Primary LanguageMATLABMIT LicenseMIT

EigenPro

Intro

EigenPro is a preconditioned (stochastic) gradient descent iteration proposed in the paper:

Ma, Siyuan, and Mikhail Belkin. Diving into the shallows:
a computational perspective on large-scale shallow learning.
In NIPS, 2017.

It accelerates the convergence of SGD iteration when minimizing linear and kernel least squares, defined as

TODO: Update README

where is the labeled training data. Let .

Consdier the linear setting where is a vector space and . The corresponding standard gradient descent iteration is hence,

where is the covariance matrix and . The step size is automatically set as to ensure fast convergence. Note that the top eigenvalue of the covariance is calculated approximately. We then construct EigenPro preconditioner P using the approximate top eigensystem of H, which can be efficiently calculated when H has fast eigendecay.

Here we select to counter the negative impact of eigensystem approximation error on convergence. The EigenPro iteration then runs as follows,

With larger , EigenPro iteration yields higher convergence acceleration over standard (stochastic) gradient descent. This is especially critical in the kernel setting where (widely used) smooth kernels have exponential eigendecay. Note that in such setting is typically an RKHS (reproducing kernel Hilbert space) of infinite dimension. Thus it is necessary to parametrize the (approximate) solution in a subspace of finite dimension (e.g. ). See the paper for more details on the kernel setting and some theoretical results.

Using the code

Preprocessed MNIST data

The data is preprocessed by mapping the feature into [0, 1].

cd data
unzip mnist.zip

Setting CPU/GPU flag

User can change the first line in 'src/example_usage.m' to switch between GPU and CPU. When set

use_gpu = true;

the script will use MATLAB gpuarray to store data and weights.

Selecting the kernel

User can select the kernel function by adding the param 'kernel' as a key-value pair when declaring a model.

The current options are 'Gaussian', 'Laplace', and 'Cauchy'.

Running experiments (in MATLAB console)

The experiments will compare Pegasos, Kernel EigenPro, Random Fourier Feature with linear SGD, and Random Fourier Feature with EigenPro on MNIST.

run('run_expr.m')

Training with SGD/EigenPro iteration

SGD iteration

The following function call will update 'initial_weights' to 'new_weights' using 'n_epoch' SGD epochs.

[new_weights, time] = ...
    sgd_iterate(random_seed, train_x, train_y, initial_weights,
                phi, eta, batch_size, n_epoch, method_name);

Note that 'phi' is a given feature function that maps the original data features to kernel features or random Fourier features. Besides, there are two methods available now: 'Pegasos' and 'Linear'.

EigenPro iteration

EigenPro iteration has interface similar to that of SGD iteration.

[new_weights, time] = ... eigenpro_iterate(random_seed, train_x, train_y, initial_weights, phi, eta, batch_size, n_epoch, method_name, k, M, tau);

Noticeably, it has extra parameters specified for EigenPro.
Here we will compute the top-'k' eigensystem of
a subsample covariance involving 'M' data samples
to form the EigenPro preconditioner.
The available methods are 'Kernel EigenPro' and 'EigenPro'.



## Reference experimental results

### Classification Error (MNIST)
In these experiments, EigenPro (Primal) achieves classification error 1.23%, after only 10 epochs. In comparison, Pegasos takes 160 epochs to reach the same error. Although the number of random features used by EigenPro (Random) and RF/DSGD is 6 * 10^4, same as the number of training points, methods using random features deliver slighly worse performance. Specifically, RF/DSGD has error rate 1.71% after 40 epochs and Pegasos reaches error rate 1.65% after the same number of epochs.

<table>
  <tr>
    <th rowspan="2">#Epochs</th>
    <th colspan="4">Primal</th>
    <th colspan="4">Random Fourier Feature</th>
  </tr>
  <tr>
    <td align="center" colspan="2">EigenPro</td>
    <td align="center" colspan="2">Pegasos</td>
    <td align="center" colspan="2">EigenPro</td>
    <td align="center" colspan="2">RF/DSGD</td>
  </tr>
  <tr>
    <td></td>
    <td align="center">train</td>
    <td align="center">test</td>
    <td align="center">train</td>
    <td align="center">test</td>
    <td align="center">train</td>
    <td align="center">test</td>
    <td align="center">train</td>
    <td align="center">test</td>
  </tr>
  <tr>
    <td>1</td>
    <td>0.92%</td>
    <td>2.03%</td>
    <td>5.12%</td>
    <td>5.21%</td>
    <td>0.80%</td>
    <td>1.93%</td>
    <td>5.21%</td>
    <td>5.33%</td>
  </tr>
  <tr>
    <td>5</td>
    <td>0.10%</td>
    <td>1.44%</td>
    <td>2.36%</td>
    <td>2.84%</td>
    <td>0.12%</td>
    <td>1.49%</td>
    <td>2.48%</td>
    <td>2.98%</td>
  </tr>
  <tr>
    <td>10</td>
    <td>0.01%</td>
    <td>1.23%</td>
    <td>1.58%</td>
    <td>2.32%</td>
    <td>0.03%</td>
    <td><b>1.44%</b></td>
    <td>1.66%</td>
    <td>2.37%</td>
  </tr>
  <tr>
    <td>20</td>
    <td>0.0%</td>
    <td><b>1.20%</b></td>
    <td>0.90%</td>
    <td>1.93%</td>
    <td>0.01%</td>
    <td>1.45%</td>
    <td>0.98%</td>
    <td>2.03%</td>
  </tr>
  <tr>
    <td>40</td>
    <td>0.0%</td>
    <td>1.20%</td>
    <td>0.39%</td>
    <td>1.65%</td>
    <td>0.0%</td>
    <td>1.46%</td>
    <td>0.49%</td>
    <td>1.71%</td>
  </tr>
</table>


### Training Time per Epoch

<table>
  <tr>
    <th rowspan="2">Computing<br>Resource</th>
    <th colspan="2">Primal</th>
    <th colspan="2">Random Fourier Feature</th>
  </tr>
  <tr>
    <td align="center">EigenPro</td>
    <td align="center">Pegasos</td>
    <td align="center">EigenPro</td>
    <td align="center">RF/DSGD</td>
  </tr>
  <tr>
    <td>One GTX Titan X (Maxwell)</td>
    <td align="center">4.8s</td>
    <td align="center">4.6s</td>
    <td align="center">2.2s</td>
    <td align="center">2.0s</td>
  </tr>
  <tr>
    <td>One GTX Titan Xp (Pascal)</td>
    <td align="center">2.6s</td>
    <td align="center">2.3s</td>
    <td align="center">1.1s</td>
    <td align="center">1.0s</td>
  </tr>
  <tr>
    <td>Two Xeon E5-2620</td>
    <td align="center">72s</td>
    <td align="center">70s</td>
    <td align="center">78s</td>
    <td align="center">72s</td>
  </tr>
</table>

### EigenPro Preprocessing Time
In our experiments we construct the EigenPro preconditioner by computing the top 160 approximate eigenvectors for a subsample matrix with 4800 points using Randomized SVD (RSVD).

<table>
  <tr>
    <th>Computing<br>Resource</th>
    <th>RSVD Time<br>(k = 160, m = 4800)</th>
  </tr>
  <tr>
    <td>One GTX Titan X (Maxwell)</td>
    <td align="center">7.6s</td>
  </tr>
  <tr>
    <td>One GTX Titan Xp (Pascal)</td>
    <td align="center">6.3s</td>
  </tr>
  <tr>
    <td>Two Xeon E5-2620</td>
    <td align="center">17.7s</td>
  </tr>
</table>