Build Status

JavaZone 2016 - Machine learning

This repository contains code used to create examples for the talk Maskinlæring skriver din neste presentasjon! (held in Norwegian) at JavaZone 2016.

Topic modeling

We have used Mallet for doing topic modeling with all the summary + description of all the presentations. Good starting points for learning more about topic modeling are Mallet Topic Modeling and The Programming Historian's lesson on Topic Modeling and Mallet. We have used the example code at Topic Modeling for Java Developers as our starting point.

Vector space models

We have used [Deeplearning4j] (http://deeplearning4j.org) for generating our vector space models with the word2vec package. We used the CBOW algorithm and not Skip-gram because of our small dataset. We also used [Stanford CoreNLP] (http://stanfordnlp.github.io/CoreNLP/) for doing preprocessing of the text documents. In particular the POS tagging and lemmatizing are useful preprocessing techniques that comes right out the box in the Stanford CoreNLP. Usefull links to learn more about vector space models:

Character sequences with recurrent neural network (RNN)

To genererate abstract based on the abstracts from previous years we use GravesLSTM from Deeplearning4j. We have two working versions, one using Spark and one custom made CharacterIterator (as in this example from Deeplearning4j). We seem to get best results with the Spark version, even with the same parameters. Useful links to learn more about RNN:

Review tool

We used the [jLibSVM] (https://github.com/davidsoergel/jlibsvm) as our SVM implementation together with [Stanford CoreNLP] (http://stanfordnlp.github.io/CoreNLP/) for preprocessing. The documentation for jLibSVM is a little sketchy, and Spark has better documentation on their [SVM implementation] (http://spark.apache.org/docs/latest/mllib-linear-methods.html). However, the required setup for using Spark is a little overkill for our simple demonstration. We trained a SVM model on our own dataset + the titels from the [Topic Modeling in Programming Languages dataset] (https://github.com/mgree/tmpl/blob/master/www/backend/abstracts/docs.dat).