This repository contains code used to create examples for the talk Maskinlæring skriver din neste presentasjon! (held in Norwegian) at JavaZone 2016.
We have used Mallet for doing topic modeling with all the summary + description of all the presentations. Good starting points for learning more about topic modeling are Mallet Topic Modeling and The Programming Historian's lesson on Topic Modeling and Mallet. We have used the example code at Topic Modeling for Java Developers as our starting point.
We have used [Deeplearning4j] (http://deeplearning4j.org) for generating our vector space models with the word2vec package. We used the CBOW algorithm and not Skip-gram because of our small dataset. We also used [Stanford CoreNLP] (http://stanfordnlp.github.io/CoreNLP/) for doing preprocessing of the text documents. In particular the POS tagging and lemmatizing are useful preprocessing techniques that comes right out the box in the Stanford CoreNLP. Usefull links to learn more about vector space models:
- [Introduction to word2vec] (http://deeplearning4j.org/word2vec)
- [Stemming and lemmatizing] (https://en.wikipedia.org/wiki/Stemming)
To genererate abstract based on the abstracts from previous years we use GravesLSTM from Deeplearning4j. We have two working versions, one using Spark and one custom made CharacterIterator (as in this example from Deeplearning4j). We seem to get best results with the Spark version, even with the same parameters. Useful links to learn more about RNN:
- A Beginner’s Guide to Recurrent Networks and LSTMs
- Recurrent Neural Networks in DL4J
- The Unreasonable Effectiveness of Recurrent Neural Networks
We used the [jLibSVM] (https://github.com/davidsoergel/jlibsvm) as our SVM implementation together with [Stanford CoreNLP] (http://stanfordnlp.github.io/CoreNLP/) for preprocessing. The documentation for jLibSVM is a little sketchy, and Spark has better documentation on their [SVM implementation] (http://spark.apache.org/docs/latest/mllib-linear-methods.html). However, the required setup for using Spark is a little overkill for our simple demonstration. We trained a SVM model on our own dataset + the titels from the [Topic Modeling in Programming Languages dataset] (https://github.com/mgree/tmpl/blob/master/www/backend/abstracts/docs.dat).