/Good-Papers

I try my best to keep updated cutting-edge knowledge in Machine Learning/Deep Learning and Natural Language Processing. These are my notes on some good papers

Some good papers I like

Basic Background

  • Gradient regarding to algebra (TODO list)
  • Gaussian CheatSheet
  • Kronecker-Factored Approximations (TODO list)
  • Asynchronous stochastic gradient descent

Topics with detailed notes

  • Exponentiated Gradient (EG)
  • Expectation Propagation
  • Gaussian processes
  • Gaussian Processes with Notes

Papers with detailed notes

Papers with quick notes

  • Neural Architectures for Named Entity Recognition (https://arxiv.org/pdf/1603.01360.pdf): A good paper shows how to build a state-of-the-art NER without using dictionary (i.e. gazertteer) or any external labeled data (the only source is training and unlabeled data for training word-embeddings). They key lies in a simple biLSTM that uses to learn feature representation of words automatically, and a CRF on top of the LSTM to score the output. Here, putting CRF on top of the LSTM aims to model that fact that we can impose several hard constraints (e.g., I-PER cannot follow B-LOC). This thing is very hard (or not obvious) to do with Neural Network, I think. It should be noted that the embedding of a word in-context computed with a bidirectional LSTM is with character-based level. This turns out to be very helpful to NER.

  • Grammar as a Foreign Language (https://arxiv.org/pdf/1412.7449.pdf): This paper shows how to apply sequence to sequence with attention to syntactic constituency parsing. To do that, they first linearize the parse tree, and this can be done by following a depth-first traversal order. The result is very competitive with this simple model.

  • Enriching Word Vectors with Subword Information (https://arxiv.org/pdf/1607.04606.pdf): An interesting paper which shows that having a distance vector representation of words ignores the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish. They propose a subword model for word embedding, in which a word is represented by a sequence of n-grams and the word itself. For instance, let us consider the word where. It will be represented by a sequence of 3-grams as: <wh, whe, her, ere, re>, . Suppose that we are given such a sequence G of n-grams, and each of the n-grams is represented by a specific vector representation z. The scoring function between the word and its context c is computed as: \sum_{all n-grams g} z_g*v_c. This simple model, albeit being slightly adhoc, allows sharing the representations across words, thus allowing to learn reliable representation for rare words.

  • Compositional Learning of Embeddings for Relation Paths in Knowledge Bases and Text (https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/acl2016relationpaths-1.pdf): Modeling relation paths have shown to offer significant gains in embedding models for knowledge base completion. The key in the doing so lies in enumerating all possible paths. Previous work did some approximation such as random walk or paths that have a certain that is greater than a certain threshold should be involved. This paper shows that we can actually do dynamic programming to effciently incoporate all relation paths. The technique, however, is slightly complicated to understand. So I just put a note here as a mark that enumerating possible paths is indeed possible to do efficiently. I need further readings to understand the method better, though.

  • Traversing Knowledge Graphs in Vector Space (https://arxiv.org/pdf/1506.01094.pdf): A knowledge graph (such as FreeBase) consists of a set of entities and their corresponding binary relations (edges). The paper focuses on composing these relations to form path queries in the knowledge graph. Path queries can be used to pose more interesting compositional questions like: "Where are Tad Lincoln's parents located", where we already have the entries Tad Lincoln, Lincoln, dad relation and Lincon located in the knowledge graph. This paper does so by proposing a simple objective function to train a knowledge graph completition model: J(Theta) = sum_{all paths with different lengths from a certain source and target entities s_i, t_i) sum_{all neighbors t'i of target entities t_i) |(1 - margin (score(s_i to t_i) - score(s_i to t'i))|{+}. where |a|{+} implies 0 if a < 0 and = a if a > 0. Doing so surprisingly improves state-of-the-art performance in knowledge graph completition. Training a model with such an objective function needs a curriculum strategy: they trained on path of queries of path length 1 in the first place, and then keep training with longer path of queiries.

  • Distant supervision for relation extraction beyond the sentence boundary (http://www.aclweb.org/anthology/E17-1110): This is an interesting paper that shows how to do relation extraction using distant supervision but beyond the sentence boundary. This is a setting that has not been tried before, and the curiousity is, of course, about whether it is possible to do so. The key idea is to adopt a document-level graph representation that augments conventional intra-sentential dependencies with new dependencies introduced for adjacent sentences and discourse relations. We might ask what kind of arcs we should add to connect sentences? A simple but straightforward strategy is to add an edge between the dependency roots of adjacent sentences. They also investigate the impact of coreference and discourse parsing for this purpose. It should be noted that as we augment this graph with new arcs, the number of possible entities should grow. The authors propose a nwe strategy in candidate selection, which what they called minimal-span candidates. Also, the number of possible paths between entities grow, the authors show that having top N shortest paths can help us restrict the candidates. Finally, there are a bunch of useful features. Dependency paths have been established as a particularly effective source for relation extraction features. Together with dependency paths, lexical item, lemma, part-of-speech tag can be also used. Results show that compared to extraction within single sentences, cross-sentence extraction attains a similar accuracy, even though the recall for the latter is much higher. Adding more paths other than the shortest one led to a substantial improvement in accuracy. Adding discourse relations, on the other hand, consistently led to a small drop in performance. While the technique in the paper is not that novel, I like the idea of going beyond sentence boundary and I like the outcome of doing so.

  • Distant supervision for relation extraction without labeled data (https://web.stanford.edu/~jurafsky/mintz.pdf): This is a classic paper that shows that we can learn relation extraction using FreeBase (any other Knowledge Base). The key assumption is that given two entities that appear in the same sentences that appear in a relation in KB as well, we can assume that they express the relation in someway. Therefore, we can extract features related to the the pair of entities, such as specific words between and surrounding the two entities in the sentence in which they appear, The part-of-speech tags of these words, A window of k words to the left of the entity and their part-of-speech tags, a dependency path between each pair of entities and so on. Results show that the distant supervision algorithm is able to extract high-precision patterns for a reasonably large number of relations. I like the idea, but I don't like the term "Distant supervision", since it makes me really confusing.

  • Feature Hashing for Large Scale Multitask Learning (http://alex.smola.org/papers/2009/Weinbergeretal09.pdf): This is a classic paper regarding to feature hashing. Given million of sparse features representing by words, it would be non-trivial to convert them into numerical numbers as this increases the dimensionality (another problem is dealing with unknown words). The paper shows that we should do hashing the words in the first place, and then use the value to indicate the index of words. Collions may happen for sure, but they show that for sparse features, it would not be a problem at all. Also they show that we should have two kinds of hash functions: one that hash words into a number and the another one that hash words into a binary of +1 and -1. By doing so it helps cancel out the collion once it happens, but I don't really understand why. Finally, the number of bits would be normally around 22-24 for the hashing, but my experience is that sometimes we can need a lot more bits. In overall, this is a classic technique to know and use in practice. I highly recommend to practice it. Note that (from wikipedia https://en.wikipedia.org/wiki/Feature_hashing) the hashing trick isn't limited to text classification and similar tasks at the document level, but can be applied to any problem that involves large (perhaps unbounded) numbers of features.

  • A Review of Relational Machine Learning for Knowledge Graphs (https://arxiv.org/pdf/1503.00759.pdf): An exceptional review of relational machine learning methods for knowleged graphs. The paper presents two main lines of methods that aims to predict new "fact" in a graph: latent feature models (mainly bilinear model, multi-layer perceptron and neural tensor networks), and graph feature models. The paper also presents methods that combine those two approaches. The paper also describes how to train the models in general, using penalized maximum likelihood training with log-loss function or squared loss (Using squared loss function is particularly efficient in combination with a closed-world assumption due to its efficiency in training). It also describes how to generative negative examples, which is important since knowledge graph often contains only positive training examples. It shows an interesting method of training the model where the negative data is not actually negative (i.e. using pairwise loss function with margin-based ranking loss function).

  • PoincarĂ© Embeddings for Learning Hierarchical Representations (https://papers.nips.cc/paper/7213-poincare-embeddings-for-learning-hierarchical-representations.pdf): This paper tries to have a better embedding models for learning hierarchical representations. The goal in such an embedding is to capture not only similarity but also hierarchy. Similarity in the sense that placing connected nodes close to each other and unconnected nodes far from each other, and hierarchy in the sense that placing nodes lower in the hierarchy farther from the origin, and nodes high in the hierachy close to the origin. Doing so can be possible with Euclidean geometry, but it is difficult and requires more dimensionality space for the embedding. The key to the solution is relying on hyperbolic space, which is found useful in learning large network. Training the model should be as straightforward as training word2vec. Nevertheless, the distance is different (which is computed based on the hyperbolic space). I don't really get the details, but it is good o know such an interesting study.

  • Learning to Optimize (https://arxiv.org/abs/1606.01885): This paper explores automating algorithm design and present a method to learn an optimization method. The paper uses reinforcement learning to train their own model, with a note that simple supervised learning does not fit to the task since the data is not i.i.d. The authors also mention they use guided policy search instead of simple policy gradient due to its difficulty in training policy gradient (high variance). Overall this is a very good paper to know and there are a lot of things new to me. A good reference to the paper can also be found here http://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/.

  • Deep Learning without Poor Local Minima (http://www.mit.edu/~kawaguch/publications/kawaguchi-nips16.pdf): This paper is a good theoretical reference that shows that: for deep linear neural networks, every local minimum is a global minimum, but for deep non-linear neural networks, it is also the same given that there are some unrealistic assumptions. The paper is far from applicable to non-linear neural networks, but it is a nice progress to push forwards a better understanding of training deep neural networks.

  • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization (https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf): It is often thought that a main source of difficulty for gradient descent method to find the global minimum is the proliferation of local minima with much higher error than the global minimum. Actually, recent results suggest this is unlikely the case, and this paper is one of the studies that raise this issue. Specifically, the paper claims that the ratio of the number of saddle points to local minima increases exponentially with the dimensionality N. Based on results from statistical physics, random matrix theory, neural network theory, and empirical evidence, that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. The authors also propose algorithms to address the problem (escape saddle points), but admittedly it is too technical for me to understand those.

  • The Loss Surfaces of Multilayer Networks (https://arxiv.org/pdf/1412.0233.pdf): The paper provides some empirical/theoretical results that strike me. For large NNs, critical points found there are local minima of high quality measured by the test error. But it is unlikely the case with small NNs. Specificially, a major difference between large- and small-size networks is that for the latter poor quality local minima have nonzero probability of being recovered.

  • The Curse of Highly Variable Functions for Local Kernel Machines (http://nicolas.le-roux.name/publications/Bengio06_curse.pdf): The paper argues that non-parametric learning is not a right approach to address high-dimensional problem. Specificially, non-parametric learning prefers solutions f such that when data is similar, the output would be similar as well. This would not work well in high-dimensional space. You simply cannot just average values in a neighbour to get a meaningful value for an input. We can of course change hyper-parameters to define the smoothness. But let's say if we have a complex function that has a million ups and downs, the paper shows that we need at least a half of a million of data to learn that function. This is problematic because in high-dimension space, the number of ups and downs is exponential. While I don't really have time to understand the mathematics behind the paper, I think this paper is important. It covers a very broad class of algorithms: kernel machine algorithms, unsupervised learning algorithms (Laplacian Eigenmaps, Spectral Clustering), semi-supervised learning algorithms

  • Neural Optimizer Search with Reinforcement Learning (https://arxiv.org/abs/1709.07417): A follow-up paper on optimizer search with reinforcement learning. Specifically, a controller in the form of a recurrent neural network is trained to generate an update equation for the optimizer. A simple domain specific language for update rules is developed. The model is trained with reinforcement learning, using the reward signal as the accuracy on a held-out dataset given a fixed number of epochs. Experiments show promising results with neural optimizer search. Overall, this is a very nice paper to read. It is, however, too slow to train a network to generate update rules (less than a day using 100 CPUs).

  • Learning to Learn by Gradient Descent by Gradient Descent (https://arxiv.org/pdf/1606.04474.pdf): A neat paper to read. The movement of using automatically-learned features is widely successful. The authors question themselves whether we should learn to learn optimization algorithms automatically instead of picking a relevant optimization algorithms (SGD, adam, momentum ...). This is called as meta-learning. The authors propose a new method that updates model parameters based on: \theta_{timestep t+1} = \theta_{timestep t} + g_{time_step t}(gradient of model parameters with respect to time step t, parameter of g at time step t). The authors choose g as a recurrent neural network (specifically, an LSTM) to do so. While the method is interesting and promising, it has downsides that the magnitude of the values being fed into the LSTM can vary widly and neural networks do not perform well when that happens. Also, it is super slow to train an LSTM network to as an optimizer. Finally, it is hard for me to believe that we can get convergence result faster with this method (i.e. we need to train two different networks at the same time). Good reference of the work can be found here: https://theneuralperspective.com/2017/01/04/learning-to-learn-by-gradient-descent-by-gradient-descent/

  • Neural Architecture Search with Reinforcement Learning (https://openreview.net/pdf?id=r1Ue8Hcxg): An intriguing paper which shows that we can design a good neural network automatically based on a neural architecture search. Specifically, we can specify the structure and connectivity of a neural network by using a configuration string. We then train an RNN with reinforcement learning that rewards network wigh high accuracy on a validation set (the policy gradient is computed to update the controller). During training, in the next iteration, the controler will give higher probabilities to architectures that receive high accuracies. The nice thing with Neural Architecture search is that it can learn good model from scratch, and, the methods are not limited like others in that they only search models from a fixed-length space. The downside of the method is that it is too expensive to train such a controler. Overall, this is a very nice/neat paper to read.

  • SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983): A simple and fun read. The authors propose a simple warm restart technique for stochastic gradient descent. I don't think I understand in detail how the warm restart mechanism works, but in brief, the learning rate is initialized to some value and is scheduled to decrease in each restart. This found to help SGD in terms of making it faster to converge (2x - 4x times). While the paper is nice, I don't really have an intuition why warm restarts help.

  • Google Vizier: A service for black-box optimization (https://research.google.com/pubs/pub46180.html): The paper presents design of a black-box optimization framework that produces state-of-the-art performance. The framework is an internal service that has become the de factor parameter tuning at Google. I am, however, not sure whether we can try it to do optimization for our own problem. But anyway, it is an interesting paper to know.

  • Whodunnit? Crime Drama as a Case for Natural Language Understanding (http://homepages.inf.ed.ac.uk/scohen/tacl17csi.pdf): A fun read. The paper introduces the task of identifying the perpetrator in a crime series as a sequence labeling problem, and shows how LSTM behaves with an increment process. Three things I learned form the work: LSTM could work well if it has enough data, and it works much better than CRF. LSTM does not give a firm prediction, while the human does (i.e. when human predicts a perpetrator, he/she sticks with that). Finally, LSTM does not have an ability of predicting the case where there is no perpetrator at all (i.e. the victim suicides).

  • Learning bilingual word embeddings with (almost) no bilingual data (http://www.aclweb.org/anthology/P17-1042): A good work shows how to learn bilingual word embeddings with only around 25 biingual word pairs. It does so with a proposed self-learning approach that develops on previous work, with only a minor yet crucial modification that the learning process repeats over and over again. The authors show very good results with good insights of why such good results are achived. I don't have background of the bilingual word embeddings task, but I think the paper is definitely interesting!

  • Towards Decoding as Continuous Optimisation in Neural Machine Translation (http://www.aclweb.org/anthology/D17-1014): Decoding in NMT is hard regarding to: 1. There is a potential limit of incorporating additional global features or constraints, and 2. Decoding in left-to-right manner does not use exploited the right context from right-to-left manner. The paper addresses the challenge by relaxing this discrete optimisation problem into a continuous optimisation problem. That is, we can drop the integrality (i.e. one-hot vector) constraint from the predic- tion variables and allow them to have soft assignments within the probability simplex. The idea is bold/cool and the authors are the first ones who implement such a thing. A lot of work need to be done to make the model work, including model initialization, learning rate, etc. While the work is definitely good, and the proposed decoding framework is novel, it is unclear to me whether the relaxation is really a right way to solve these above problems?!

  • Backprop is not just the chain rule (http://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/): This article shows a connection between Backpropagation and the Lagrange method. While I am not a realy big fan of philosophical questions, I like the interesting connection, and I think it can be useful to train neural networks with some kinds of constraints.

  • A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models (https://arxiv.org/pdf/1708.00111.pdf): A good paper focuses on an interesting problem in seq2seq. Specifically, typical cross entropy training procedures do not directly consider the behaviour of the decoding method. Therefore, beam decoding can be suboptimal when compared with greedy decoding (i.e. beamsize = 1). The work hypothesizes that the under-performance of beam search in certain scenarios can be resolved by using a better designed training objective that directly integrates beam search information (Equation 1). While this is ideal, the new loss function is noncontinous. The solution is relied on Gumbel approximation. The paper shows nice results with their model. It is hard to say whether the technique will be implemented widely (I doubt for a few things, mainly because I was wondering whether Hamming distance is always relevant). However, the paper is definitely an enjoy read.

  • Non-Autoregressive Neural Machine Translation (https://einstein.ai/static/images/pages/research/non-autoregressive-neural-mt.pdf and https://einstein.ai/research/non-autoregressive-neural-machine-translation): I don't think I fully got all the details in the paper and I need to read it again later. The core idea is building a NMT model that is not autoregressive. It does so by having an additional component: a fertility predictor, which gives us how many target words an input word should be translated. Once we are given this information we can decode all input words at the same time. Sophisticated techniques implemented are required, though. The result, however, is not that good (it is a bit hyped indeed). The model obtains a magnificant improvement in terms of speed up, but with a cost that translation accuracy hurts significantly.

  • The Variational Gaussian Process (https://arxiv.org/abs/1511.06499) - I don't think I fully understand the idea of using GPs in Variational Inference. But the paper really gives me a better understanding of the general picture of Variational Inference with hierarchical latent variables. That is, having a variational distribution that is also a hierarchical Bayesian model is very difficult to train, and the paper provides a cutting edge method to do so. I recommend readers to take a look at these slides (http://mlg.postech.ac.kr/~readinglist/slides/20160822.pdf). I personally learn a lot from it.

  • Natural Gradient (http://andymiller.github.io/2016/10/02/natural_gradient_bbvi.html): I heard about Natural gradient concept for a while, but I never really captured it until this paper. It is a very beautiful idea and I encourage people to take a look at this tutorial to have a better understanding of the concept!

  • Progressive growing of GANs for improved quality, stability and variation (http://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of/karras2017gan-paper.pdf) The paper shows how to improve GANs with a better training scheme. Recall the main problem with GAN is it is very difficult to train: the gradients tend to be useless if the training and generated distributions do not have substantial overlap (i.e. too easy to tell apart). Among various tweaks they did to address this problem, the most important one is to gradually improve the size of both generator and discriminator networks. Starting from a low resolution, the authors add new layers that model increasingly more fine-grained details as training progresses. I am not really familiar with GANs in practice, but resulting images the improve model produces is really impressive. Those are perhaps the best ones to date.

  • Multi-Task Bayesian Optimization (https://papers.nips.cc/paper/5086-multi-task-bayesian-optimization.pdf) - Another classic paper. The authors investigate how to transfer the knowledge gained from previous optimizations to a new and related task. This turns out pretty straightforward to do so under Multi-task GPs. They also investigate how to perform Bayesian optimization using knowledge gained from related yet easier to optimize tasks. This idea makes sense to me and it is not surprising to see how useful it is in their topic modeling case study. I was, however, not so clear about the part of optimizing an average function over multiple tasks.

  • Practical Bayesian Optimization of Machine Learning Algorithms (https://arxiv.org/pdf/1206.2944.pdf) - A very very good paper. I was aware of this work but I couldn't manage to take a look at it (well it is not easy to follow if we don't have sufficient background). Basically given a set of hyperparameters, the paper tries to address the search in a better way: Instead of trying all possible values as in exhaustive grid search, they deploy a proxy-optimization search that helps us look for promising points. The results are very promising, as the model is able to learn good hyperparameters much quicker than using exhaustive grid search. I strongly recommend to take a look at this paper to have a taste of how GPs can be used to solve a really difficult problem of hyperparameters optimization.

  • Deep Neural Networks as Gaussian Processes (https://arxiv.org/abs/1711.00165) - It has been long known how a GP corresponds to a neural network with single hidden layer. Yet it is not clear how it is with deep neural networks. This work delineates the correspondence between deep neural networks and Gaussian Processes. Specificially, the work provides an equation of covariance matrix that corresponds to each non-linear function used in deep neural networks. For RELU, it is easy to compute the equivalent covariance matrix, but for other certain non-linear functions, it seems like it is not trivial to do so. They provide an algorithm to compute covariance matrix, which is quite difficult to follow. Bayesian inference also helps the model learn where it gives good prediction and where it does not. Figure 2 is really impressive/interesting to me. Overall I think the paper is very interesting/important.

  • What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? The paper studies the benefits of modeling uncertainty in Bayesian deep learning models for vision tasks. They define two types of uncertainty: aleatoric uncertainty that captures noise inherent and epistemic uncertaint which accounts for uncertainty in the model. They propose a BNN models that captures all these things, and show how this helps visition tasks. Some of the techniques are too involved too me, but overall I enjoyed reading the paper.

  • Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (https://arxiv.org/abs/1506.02142): Take-home message: This paper is quite thought-provoking. It reveals the connection between dropout training as performing approximate Bayesian learning in a deep Gaussian Processes model. It also suggests a way to get model uncertainty (MCDropout), which is important in practice. This is because predictive mean and predictive uncertainty should provide more stable performance especially when the model is run on the wild (i.e. the test data is compleletely different to the training data). The mathematics is really involved though.

  • Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2011_0485.pdf): Take-home message: The paper ignites the trend in using asynchronous SGD instead of synchronous SGD. Assuming we are performing updates over very sparse parameters, we can perform asynchronous SGD without any needs of locking mechanism regarding to synchronizing model parameters. The mathematical proofs for the result are difficult to understand though. As a side note, the method does not fit to training NNs because NN model parameters are not that sparse.

  • Deep Kernel Learning (https://arxiv.org/abs/1511.02222) and Stochastic Variational Deep Kernel Learning (http://papers.nips.cc/paper/6425-stochastic-variational-deep-kernel-learning) and Learning Scalable Deep Kernels with Recurrent Structure (https://arxiv.org/abs/1610.08936) - Take-home message: The studies contribute a hybridge architecture between GPs and (Deep) Neural Networks. The combination makes sense, and experiment results show promising results. Training the model is end-to-end and scalable (The scalability is mainly due previous work of Andrew Wilson et. al., though, not in those studies per say). I found the research line very inspiring. Yet the papers are really technical to follow.

  • Assessing Approximations for Gaussian Process Classification (http://papers.nips.cc/paper/2903-assessing-approximations-for-gaussian-process-classification.pdf) and its longer version Assessing Approximate Inference for Binary Gaussian Process Classification (http://www.jmlr.org/papers/volume6/kuss05a/kuss05a.pdf) - Take-home message: GP classification models are intractable to train. There are three main choices to ease the intractability: using Laplace's method, Expectation Propagation and MCMC. MCMC works best but it is too expensive. Laplace's method is simple but the paper suggests that it is very inaccurate. EP works surprisingly well.

  • Sequential Inference for Deep Gaussian Process (http://www2.ift.ulaval.ca/~chaib/publications/Yali-AISTAS16.pdf) and Training and Inference for Deep Gaussian Processes (Undergrad thesis - http://keyonvafa.com/deep-gaussian-processes/) - Take-home message: Deep GPs are powerful models, yet difficult to train/inference due to computation intractability. The papers address the problem by using sampling mechanisms. The techniques are very straightforward. The first paper totally eases computational cost by using an active set instead of the full dataset. The size of active set has a quite significant impact to the performance, though. As a side note, the performance really depends on parameter initialization (The second paper). The first paper really shows the benefits of having deep GP models, even though a deep GP model does not work well with MNIST classification (the accuracy is quite low, around 94%-95%). The first paper is really good and deserves more attention.

  • Efficient softmax approximation for GPUs - https://arxiv.org/abs/1609.04309 - Take-home message: Providing a systematic comparison between various methods for speeding up training NLMs with large vocabulary. The paper also proposes an one that pretty fits to GPUs. Their method is very technical to follow, though. The proposed one works best. Meanwhile, their modification to Differentiated Softmax works pretty well. But it is totally unclear to me how they modify D-Softmax.

  • Strategies for Training Large Vocabulary Neural Language Models - http://www.aclweb.org/anthology/P16-1186 - Take-home message: Providing a systematic comparison between various methods for speeding up training neural language models with large vocabulary. Hierarchical softmax works best for large dataset (very surprising), differentied softmax works well for small-scale dataset (but the speed up factor is not so high). NCE works very bad, and Self-normalization works OK. Good notes on the paper can be also found here https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/strategies-for-training-large-vocab-lm.md

  • See, hear, and read: deep aligned representations - https://arxiv.org/abs/1706.00932: The paper proposes a nice Cross-Modal Networks to approach the challenge of learning discriminative representations shared across modalities. Given inputs as different types (image, sound, text), the model produces a common representation shared across modalilities. The common representation can bring huge benefits. For instance, let us assume we have pairs of images and sound (from videos). Let us also assume we have pairs of images and text (from caption datasets). Such a common representation can map between sound and text using images as a bridge (pretty cool!). It is however unclear from the paper how the cross-model networks are designed/implemented. Lots of technical details are missing, and it is very hard to walk through the paper.

  • Bagging by Design (on the Suboptimality of Bagging) - https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8406 - Take-home message: a nice study, proposing a provably optimal subsampling design-bagging method. The proposed one outperforms the original bagging method convincingly (both theoretical and experimental aspects).

  • On Multiplicative Integration with Recurrent Neural Networks - https://arxiv.org/abs/1606.06630 - Take-home message: we can modify the additive integration with multiplicative integration in RNN. The goal is to make transition (i.e. gradient state over state) tighter to inputs.

  • Importance weighted autoencoders - https://arxiv.org/abs/1509.00519 - Take-home message: A nice paper, showing that training with weighted sample always better (a clear explanatio from the paper). Also, one can tighten the bound simply by drawing more samples in the Monte Carlo objective.

  • Adversarial Autoencoders - https://arxiv.org/abs/1511.05644 - Take-home message: Instead of using KL divergence as in Variational autoencoders, we should rather optimize JS divergence. This makes sense as JS could better than KL in inference.

  • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima - https://arxiv.org/abs/1609.04836 Take-home message: Very good paper, explaing why SGD with small-batch size is so useful: an optimizer should aim for flat minima (maxima) instead of sharp minima(maxima). Small batch size helps achieve this because there is lots of noisy gradients in training.

  • Deep Exponential Families: https://arxiv.org/abs/1411.2581 - Take-home message: Stacking multiple exponential models (up to 3 layers) can improve the performance. Inference is much harder, though. I personally like this work a lot!

  • Semi-Supervised Learning with Deep Generative Models - https://arxiv.org/abs/1406.5298 - Take-home message: a classic on semi-supervised learning with deep generative models, using stochastic variational inference for training. The model may not work as well as ladder networks, yet it is classic and has broad applications.

  • Exponential Family Embeddings - https://arxiv.org/abs/1608.00778 - Take-home message: A very cool work, showing how to generalize word2vec to other very interesting models (e.g. items in a bucket). Also, instead of using exp as in the original model, the paper shows other possibilities including posson, gaussian, bernoulil. I personally like this work a lot!

  • Hierarchical Variational Models - https://arxiv.org/abs/1511.02386 - Take-home message: Showing how to increase the richness of q function by using a hierarchical model with global paramter. The model itself is equivalent to Auxiliary Deep Generative Models (https://arxiv.org/abs/1602.05473)

  • The Marginal Value of Adaptive Gradient Methods in Machine Learning - https://arxiv.org/abs/1705.08292 - Take home message: Adagrad/Adam and other adaptive methods are awesome, but if we tune SGD properly, we could do much better.

  • Markov Chain Monte Carlo and Variational Inference: Bridging the Gap - https://arxiv.org/abs/1410.6460 - Take-home message: The paper proposes a very nice idea how to improve MCMC using variational inference (by exploiting the likelihood to see whether the chain converges and to estimate the parameter for tuning MCMC). Meanwhile it also can help variational inference using MCMC, but how? This is the point I don't really quite get!

  • Categorical Variational Autoencoders using Gumbel-Softmax - https://arxiv.org/abs/1611.01144 - Take-home message: How to convert discrite variable into an approximate form that fits into reparameterization tricks (using Gumbel-Softmax function).

  • Context Gates for Neural Machine Translation - https://arxiv.org/abs/1608.06043 - Take-home message: The paper shows that in seq2seq we should control how a word is generated. A content word should generated based on inputs, while a common word should be generated based on the context of target words. The paper proposes a gate network that integrate the information into seq2seq in a nice way.

  • https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html: some models trained on conversation data for sentence embedding, as well as some models use for question answering. Pretty basic.

  • Rationalizing Neural Predictions: very nice paper that learn the rationale behind the prediction given input. https://aclweb.org/anthology/D16-1011