Pytorch only provides autograd methods to calculate the gradient of scalars, but sometimes Jacobian - the gradient of vectors, are also significant in researches. In concrete, for N inputs
There is already a package "backpack" aiming at this problem, but its BatchGrad - synonymous to Jacobian - only supports nn.Linear and nn.Conv2d layers. Here is the website https://backpack.pt/ and paper https://openreview.net/pdf?id=BJlrF24twB of backpack.
Based on the main idea of backpack, this repository provides a more general interface for fast Jacobian calculations in pytorch networks. In example.py, it works more than 100 times faster than naive for loops when the batch size is 1000.
This package is available for networks satisfying the following conditions.
-
All layers with parameters in the network should be 'quasi-linear'. 'Quasi-linear' includes linear layers and convolutional layers in different dimensions. The concrete math description is that the relation between weight
$W_i$ bias$b_i$ input$x_j^{(n)}$ output$y_k^{(n)}$ can be written as$y_k^{(n)} = \sum_{ij} M_{kij} W_i x_j^{(n)} + \sum_i N_{ki} b_i$ where M and N are both tensors with elements 1 or 0. If one creates a layer like$y = W^2 x$ then this package will fail. -
Parameters in a module can be reached by module.weight or module.bias. This is the default behavior of pytorch layers like nn.Linear and nn.Conv2d. If you define a module by yourself please put your parameters accordingly.
-
All layers with parameters should be immediate children modules. For example,
# This works
net = nn.Sequential(nn.Conv2d(1,1,2), nn.ReLU(), nn.Conv2d(1,1,2), nn.ReLU())
# This doesn't work
combine = nn.Sequential(nn.Conv2d(1,1,2), nn.ReLU())
net = nn.Sequential(combine, combine)
-
Apart from the batch dimension, all other input dimensions fixed, or at least there only exists a few input shapes. For example, the input x can be of shape (1000, 1, 10, 10) as well as (10, 1, 10, 10), while (100, 1, 10, 10) and (100, 1, 9, 9) is not perfect (takes more time but still ok), and (100, 1, ?, ?) where ? can be any integer definitely can't work.
-
This method is still very memory-consuming. It can only be used in very small networks.
See example.py. The provided jacobian method has an input as_tuple default to False. If False, the output is an N by D tensor where N and D are the batch size and the number of parameters, respectively. If True, the output is a tuple aranged in the same order as net.parameters(). Every element in the tuple is a tensor with shape (N, shape of corresponding parameters).
Assume the network gives a batch of output
$ \frac{\partial z^{(n)}}{\partial W_i} = \sum_k \frac{\partial z^{(n)}}{\partial y_k^{(n)}} \frac{\partial y_k^{(n)}}{\partial W_i} $
The first part is trivial. Assume $ z = \sum_n z^{(n)} $, then
$ \frac{\partial z}{\partial y_k^{(n)}} = \frac{\partial z^{(n)}}{\partial y_k^{(n)}} $
A single backward pass with suitable backward hook can solve it.
The second part is much more complicated so we need to invoke the 'quasi-linear' assumption
The tensors
The bias part goes similarly.
After going through all these preparations, the Jacobian can be expressed as
$ \frac{\partial z^{(n)}}{\partial W_i} = \sum_k \frac{\partial z}{\partial y_k^{(n)}} \sum_j M_{kij} x_j^{(n)}$
$ \frac{\partial z^{(n)}}{\partial b_i} = \sum_k \frac{\partial z}{\partial y_k^{(n)}} N_{ki}$