/playdl

Playing with deep learning

Primary LanguageJupyter Notebook

Exploring deep learning neural networks

This repository is just a collection of notebooks that I'm using to explore various deep learning (DL) concepts and APIs (such as keras). The mathematics behind the networks is straightforward but tedious so I'm focusing on applying networks.

AWS stuff

Deep learning image classification

Previously I tried DL for regression on some tabular data, but it struggles to keep up with the simple random forest. DL shines with unstructured data so I turned to the standard hello world of MNIST classification:

Text processing

The next step is to play around with tokenizing text and representing words and documents with sparse work vectors:

Categorical variable embedding

Before jumping into dense word vectors, I wanted to explore the whole embedding mechanism using some simple categorical variables. Before using the built-in embedding layer with keras, I wanted to build some examples manually.

PyTorch

PyTorch seems a bit more low-level but it's very cool and easy to use. It is essentially a numpy with automatic differentiation package, but also includes a torch.nn layer that looks very much like Keras. First, I went through some exercises following their tutorial to get use to the library and how to use the auto differentiation:

Now, I've had some very good luck building my own collaborative filtering mechanism using my own gradient descent with momentum in raw pytorch. I am more or less able to get the same predictive ability using matrix factorization that I did with the deep learning version above. Implemented gradient descent with momentum and then full AdaGrad. Didn't do any minibatching.

I'm going to start again on collaborative filtering with pytorch: Regularized collaborative filtering with pytorch. Lessons:

  • I was only tracking the training error, which I could get pretty low but then I was wondering why the validation error using a random forest OOB (user+movie->rating) often was not good. Duh. I'm watching the training error inside the training loop, not the validation error.
  • Ollie just said: "...it depends on what you are looking for. Embeddings as first step or embeddings or a means of reducing dimensions. First one is validation, second is training error and even not using any validation or test at all."
  • By randomly selecting validation and training sets at each iteration (epoch), we are moving to stochastic gradient descent (SGD). Previously I only used gradient descent.
  • I could not get the bias version working well. Yannet pointed out that I need to initialize my randomized weights so that the predicted rating has an average of 2.5 or somewhere else between zero and five. Initialization does after all seem to make a difference, in both training and validation error.
  • Adding bias dramatically slowed the training process down! Way more than I would expect, given the small increase in data and operations.
  • I also noticed that Jeremy/Sylvain does L2 regularization in their book, which clued me in to the fact that validation error is important. We don't want those coefficients getting too big as they are likely chasing something in the training set.
  • I also found that simple gradient descent with momentum did a pretty good job and I didn't need the complexity of AdaGrad.
  • Another trick from fastai book is to keep the predicted rating in the range 0..5, which constrains where the loss function can send gradient descent.

Dang. I still don't think I'm getting a good result and it is likely my simple gradient descent mechanism. I'm going to try this over again using the built-in pytorch optimizers. Ha. was most likely incorrect assessment and display issues with indexing. All is well now:

Text processing w/o RNNs

I'm going to try working through the keras book examples using pytorch:

oooook. It turns out I was using just 1000 training and 1000 testing records but Keras was using a much bigger data set with 25,000 records each. I can then using my PyTorch methods get roughly the same 75% accuracy from the Keras book. I redid everything in this notebook:

Recurrent neural networks

Generating text

Ok, starting again using simple RNN using matrix alg from my article.

Transducers

Attention

Lessons

Some notes beyond what is in the notebooks

  • the role of a neuron or layer is fluid. the output of a layer could be the probability of class k but, if we add a further layer, it could become a feature that helps the new layer predict features.
  • relu for RNN also works, not just tanh for obama speeches. use 0 bias, identity for W. clip gradient values to 1.0. Must use F.cross_entropy() which combines softmax with that. much better numerical characteristics. Otherwise get NaN immediately. Clipping isn't required it seems but improves convergence. Seems to tolerate higher learning rate. Can't see to get relu version to match final accuracy score of tanh. Actually at epoch 20, tanh gets 55% and relu gets 63% (same 0.001 learning rate and relu clipped grad values at 10). Hmm...i have logs of some tanh runs though that are 64% at epoch 20. damn hyperparameters.
  • Training is super sensitive to weight initialization. Using sd=0.001 training didn't move. With 1.0 it is too far away sometimes from final weights. Couldn't figure out why my stacked RNN wasn't training. sd=0.001 seems too low. Well, if you run multiple times with sd=0.001 sometimes it works. frustrating to train these beasts. Even more complicated. Seems to eventually start to convert/train after maybe 10 trivial/no movements.

  • A general principle for autograd with pytorch: you can’t reuse partial results from one optimization step to the next, despite that being OK if we were doing symbolic derivatives. It’s just a limitation of how dynamic auto gradients are computed. E.g., I wanted to use a single output vector (the last h) from the encoder in the multiple steps of the decoder but got the dreaded Trying to backward through the graph a second time, but the buffers have already been freed. Each decoder step is a function of that previous single encoder h and pytorch can only compute the derivative through a series of operations once. Tried loss.backward(retain_graph=True) w/o success too.
  • another weird detail. h in RNN becomes H matrix for batches, one h vector per record of batch. Now, if we add a bias term, we also need bias per record so B matrix not vector. Now, when using (not training) decoder to generate result, we are doing one record so we need one bias. which bias? hahah. I’ve just been doing B[:,0] picking one. is that right? i.e., we’ve trained batch_size bias vectors but use only one during generation. which bias do we use?! Ah. just use bias vector and broadcasting does the rest. so it's b not B.
  • Weight matrix initialization matters a huge amount. I was trying to do concatenation of a context vector with an input vector for a decoder but it was not training well at all. When I use a separate matrix to bring in the context in the decoder, I was initializing Matt to the identity matrix so that it brought context in unaltered. However, when concatenating I was using a random matrix for the combined input and context vectors. This got noticeably worse performance. Presumably in the end this would train just as well but it is much slower at the very least. Perhaps it wouldn't even find the right spot with gradient descent. Wow. even the stddev matters for randn matrics. If too tight, doesn't train in some cases. Might also be reverse!
  • Dropout is like bootstrapping for RFs or maybe selecting just a subset of m variables during splitting. Goal is to create multiple independent estimators.