/papers

Research papers I read

[Note: I stopped updating this after Feb 2018 as it got harder and harder to keep track of all the papers I read.]

papers

List of papers I have read, am reading and want to read starting 1st Sept 2017.

Read

  • Asynchronous Methods for Deep Reinforcement Learning - ICML 2016 - [RL]

    aka A3C. Instead of training on samples from replay memory to decorrelate temporal relations, use multiple agents operating in their own copy the environment using a current global policy. Training becomes more stable. Beats the previous best in half the training time. Train k agents on a single k-core CPU. No communication costs as with [Gorrila](https://arxiv.org/abs/1507.04296). In case of off-policy learning the individual agents can apply different policies which is more explorative and stable. Replay memory can still be used with this to increase data-efficiency.
  • Unsupervised Domain Adaptation by Backpropagation - 2014 - [CV] [GANs]

    It's a GAN in disguise. You datasets from 2 domains - 1) labelled synthetic image classes and 2) unlabelled real images. You want to label the images from real domain. Idea: There are three NN modules - G, C and D. Domain invariant features must be learn by network G. Feed the features to their equivalent of a discriminator (D), penalize N if D can predict domain from given features. Also feed the same features to classifier C train it to label the synthetic data. Over time D can't tell the domain, the features learnt are domain-invariant and by the covariate shift assumption network [G --> C] becomes good at classifying unlabelled real images.
  • Learning to Repeat: Fine Grained Action Repetition for deep reinforcement learning - ICLR 2017 - [RL]

    aka FiGAR. In policy gradient method, instead of just predicting the next action `a` from a set of actions `A` (continuous or discrete) predict a tuple (`a`, `w`) from `A` (actions) and a set of discrete integers `W`. Repeat action `a` for the next `w` time-steps. The intuition is this: in many situations you want to repeat the same action over a long range of time-steps. Decouple the prediction of `a` from `w` prevent the network from blowing up.
  • Attention Is All You Need by Ashish Vaswani et. al - 2017 - [DL]

    novelty - 10/10. Fixed number of Attend and Analyse steps == number of stacked Transformer units (6 in the paper). Transformer unit: Consists of 1) an encoder layer 2) a decoder layer. Both layers contain a sub layer for attention and a fully connected sub-layer. The decoder contains and addition masking layer for preventing the decoder from seeing current and future token. Multiple smaller attention heads used instead of single big attention head. Positional information of both input and output sequences are fused into the embeddings before feeding it to the first Transformer layer. After that the order input or output tokens doesn't matter until the next Transformer unit. Positional encoding is cleverly designed to support relative indexing for attention.

  • Residual Algorithms: Reinforcement Learning with Function Approximation by L. Baird - 1995 - [RL]

    TD(0) updates guaranteed to converge for table lookup but not for function approximators. Enter, Residual Gradient updates: Define a loss function E over the Bellman residue (RHS-LHS of Bellman eq.). Do gradient descent on w.r.t to E --> Guaranteed to converge but slow. Slow because the updates go both ways (next_state_action <--> this_state_action). Enter, Residual (delta_w_r) updates: Hit a compromise b/w TD(0) (delta_w_d) and Residual Gradient (delta_w_rg).

    TD(0) update

    Residual Gradient update

    Dotted line is the hyperplane perpendicular to the true gradient w.r.t residue (need to stay left of it for robustness). Mustn't go far from TD(0) update (the direction of fast learning). Idea: take projection of TD(0) update w.r.t dotted line, nudge it slightly to the left.

  • Efficient per-example gradient computations by Goodfellow - 2015 - [DL]

    How to calculate norm of the gradient of each example in a batch? Naive: have N batches of size 1. Better approach to calculate the gradient of loss (which is the sum of errors on all examples in the batch) w.r.t all intermediate activations of all examples in the batch Z. And use this gradient Z-bar to compute norm of per-layer per-example.
  • Differential training of Rollout policies by Bertsekas - 1997 - [RL]

    Instead of approximating Q(s,a) or V(s) which are prone to noise in the environment and training (two-way flow of information), approximate G(s,s') = V(s) - V(s') which tells how good is state s w.r.t. to s'. Interestingly standard RL methods can still be applied to approximate G. The states for this problem are (s,s') pairs and the reward is (r - r').
  • Learning from Simulated and Unsupervised Images through Adversarial Training - CVPR 2017 - [GANs]

    CVPR best paper award. Need for more annotated training data. The idea is to generate realistic images with class annotations from computer generated simulations. The generator G takes as input a computer generated simulation with a class label(like apple or orange) and makes changes to it so that it looks realistic. The dicriminator D must learn to discriminate the real images from the seemingly real ones generated by G. What if G takes a simulated image of an orange and changes it so much that it now looks like an apple?? We can't let this happen otherwise we will need somebody to re-annotate the generated images (which beats the whole purpose of automatically generating the annotated data). To prevent this, both G and D are allowed to focus on small regions of the image. This way G will never be able to make strong global changes. So class labels are preserved.
  • Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio - 2010 - [DL]

    Pre-batch normalization era: How factors such as initialization and non-linearities affect the training using SGD. Good initialization as shown by unsupervised pre-training (training each layer and its transpose to be an autoencoder) plays an important role in quick training. The activation functions should be zero-mean. The best non-linearity is cousin of tanh --> softsign (x/(1+|x|)). The best initializations have zero-mean and unit-variance.
  • Human-level control through deep reinforcement learning by Mnih et. al - Nature 2015 - [RL]

    Two extremely simple ideas. 1) Use experience replay - The order in which you provide observations (s, a, r, s`) matters. If you provide them as they come it makes Q-learning unstable for function approximators because of the correlations b/w subsequent observations. Store observations in a buffer and provide them at random. 2) Use two (instead of one) Q networks. Freeze one and use as it base for evaluating the next state value for improving the second one. After C steps change the weights of the frozen network to be exactly same as the improved network and freeze it again.. loop.
  • A Distributional perspective on Reinforcement Learning by Bellemare et. al - 2017 - [RL]

    Instead of modelling the expected reward, model a distribution over possible reward values. Stabilises training and capable of modelling intrinsic stochasticity in the environment and in the behaviour of the agent. Define equivalents of Bellman Operator and Bellman Optimality Operators in the distributional sense. They prove the Evaluation setting to be a contraction w.r.t to a particular metric - Wasserstein metrci. The Control setting however is not a contraction in any known metric. But it remains to be seen whether this presents a practical problem or not.
  • VAE: Auto-encoding variational bayes by Kingma et. al - 2014 - [Bayesian] [Unsupervised]

    Understood it through this [Tutorial](https://arxiv.org/pdf/1606.05908.pdf) and this [blog](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder). I am yet to fully grasp this from a theoretical side but from a deep learning side I think I understood this. This paper's main contribution to the AutoEncoder framework in my opinion was the fact that they perturbed the latent embeddings and made sure that the Decoder was still able to reconstruct it. But the main flaw is that the loss they use is between pixel to pixel (or dimension to dimenstion) with a complete disregard to the inter-pixel or inter-dimensional dependencies. I think this the primary reason why the generated and reconstructed images are fuzzy. Other recent papers like PixelVAE solve this problem in the image domain by using 5x5 pixelCNN autoregeressive decoder.
  • PixelVAE: A Latent Variable Model for Natural Images - By Gulrajani et. al - 2017 - [DL] [VAE]

    The major contribution on the VAE architecture is that they use teacher forcing in decoder while training using PixelCNN. This frees the latent embedding from having to memorise fine details in images. How do they guarantee that semantic information flows throught the latent space while only the style information is flows through the PixelCNN? They use a 5x5 kernel from which it is impossible to get the big picture (pun, got it?). They are able to generate sharp images through it.
  • Skip-Thought Vectors - NIPS 2015 - [NLP]

    Aim to construct semantic embeddings for sentences. Idea: given a sentence in a running text try to predict the previous sentence and the next sentence. Teacher force while predicting. If domain contain huge number of unique words, map them to the latent space of word2vec and then take the nearest neighbour in the small set of words that we want to consider. Test on downstream tasks, may put just one linear layer for adapting sentence embeddings to the task.
  • Understanding Deep Learning Requires Rethinking Generalization - By C Zhang et al - 2016 - [DL]

    Shows that a sufficiently large (with just 2*n+d parameters) network can overfit on a completely random dataset of n d-dimensional points. This shows that Neural Networks generalize well beyond the training dataset even though they have the power to overfit. Overfitting does require more time converge though. Maybe the reason the NNs generalize so well is that reaching generalizing solutions is somehow easier.
  • Dynamic Routing Between Capsules - By S Sabour et al - 2017 - [DL] [CV]

    Building blocks of a NN are vectorized capsules as opposed to scalar neurons. Network formed by layers of capsules. The output of each capsule is a squished vector with a max lenght of 1. Each capsule (a capsule for detecting a nose for example) in the lower layer distributes its output to all capsules (a capsule for detecting the face) in the next layer. The distributed outputs are weighted according to a routing matrix C. The distributed outputs undergo an affine transformation (how is the existence and pose of nose related to the existence and pose of the face) by W matrix of the higher layer. These affine transformations from each of the lower capsules to a higher capsule are then summed together to form the resultant vector for the higher level capsule. The routing matrix C is calculated by the agreement between the affine transformations from the lower layer and the resultant vector. But this is a chicken and egg problem since we don't have the resultant vector without C. Therefore, The matrix C is iteratively (iter=3) calculated from scratch in every forward pass using the agreement (dot-product) b/w the supplied output from a particular lower level capsule and the resultant vector.

    I found the idea pretty interesting but I wish there was a more elegant way of calculating the routing matrix. The ad-hoc way of calculating the routing matrix leaves the possibility of instability in training a likely possibility.

  • Neural Discrete Representation Learning - By A Oord et al - 2017 - [UL]

    How do you train an auto encoder with an autoregressive decoder. How do you ensure that the latent representations learn a global aspect of the input and not some style characteristic of the input. After all, you are just minimising the MSE reconstruction loss. The model is free to choose what information it channels throught the latent representation and what information it channels through the autoregressive mechanism. One solution to this problem is making the latent space K-way categorical for a small and finite K like K=512. VQ-VAE: Just like an ordinary VAE except that the latent space Z has some K special vectors e1, e2, e3...eK. Encoder computes a continuous z. The special vector e_i nearest to z is passed on to Decoder. e_i is artificially given the gradients of z. But how are these special vectors selected? The special vectors are randomly initialised and then updated at every iteration to minimize the l2 loss between any given z and the special vectors. The special vectors play catch-up. What if the z vectors rush outwards too fast for the special vectors to catch-up. Don't worry we got an l2 loss for that too.
  • GENERALIZING ACROSS DOMAINS VIA CROSS-GRADIENT TRAINING - By Shankar et al. - ICLR 2018 - [Domain Adaptation] [UL]

    [Do not understand some parts, will come back to it later] Awesome paper in my opinion. Assume you have a lot of training data in one domain and a little data for few other domains. How do you train a Neural Net which generalizes to data from a huge number of unseen domains? How can we leverage sparse data from few domains using a lot of data in one domain? Train two neural networks. First, standard, given a sample predicts the class label. Second NN helps in augmenting the data from sparse domains. How? Second NN is trained to predict the **domain** of the input. Augmentation is performed by perturbing the input so as to increase the loss of the second NN. Use the augmented input for training the first network. Interestingly, since the perturbations happen on a real space, the augemented input might not even belong to any of the few domains. It could be mutant domain of the domains under consideration. There is one subtle challenge though that the reader is quite likely to skim over--the perturbations must be such that they only disturb the domain of the input and not its label. I do not fully understand this yet. Will get back to this when I have more time.
  • FiLM: Visual Reasoning with a General Conditioning Layer - By Perez et. al - Dec 2017 - [Information mixing] [DL]

    A neat way to mix information from two or more diverse parts of a data point X. Normally you'd think that concatenating the latent representations of composing parts of a data point is enough to express the data point. True, but if you want to process the information you need to mix them thoroughly. In this paper, they propose simple affine transformations at each layer of the processing network. Interestingly enough these affine transformations are enough to pass enough enformation from the representation of a question in natural language to the Neural Network pipeline of the corresponding image to produce the right answer to the question.

Reading

Want to Read