N-Gram Character Level Language Model

Improving Language Models by Jointly Modelling Language as Distributions over Characters and Bigrams

More information on this project on my website and in the expose

Disclaimer

This is a WIP and will not accept PRs for now.

Measurments

Benchmarks

Model	Emebdding Size	Hidden Size	BPTT	Batch Size	Epochs	Layers	Dataset	LR	NGram	Test PPL	Test BPC
CL LSTM	128	128	35	50	30	2	Wikitext-2	20 (1/4 decay)	1	3.76	1.91
N-Gram CL LSTM	128	128	35	50	30	2	Wikitext-2	20 (1/4 decay)	1	3.72	1.89
N-Gram CL LSTM	128	128	35	50	30	2	Wikitext-2	20 (1/4 decay)	2	11.72	8.12
N-Gram CL LSTM	128	128	35	50	30	2	Wikitext-2	20 (1/4 decay)	2	1.96 (only unigrams)	0.47 (only unigrams)
---	---	---	---	---	---	---	---	---	---	---	---
N-Gram CL LSTM	512	512	200	50	34	2	Wikitext-103	20 (1/4 decay)	2	7.96	2.98
---	---	---	---	---	---	---	---	---	---	---	---
N-Gram CL LSTM	400	1840	200	128	23	3	enwiki-8	10 (1/4 decay)	2	1.63	0.69
Paper LSTM ¹ ²	400	1840	200	128	50	3	enwiki-8	0.001 (1/10 decay)	1	-	1.232
---	---	---	---	---	---	---	---	---	---	---	---
N-Gram CL LSTM	200	1000	150	128	500	3	ptb	4 (1/4 decay)	2	8.01	3.00
N-Gram CL LSTM	200	1000	150	128	500	3	ptb	4 (1/4 decay)	2	1.60 (only unigrams)	0.68 (only unigrams) (no optimizer)
N-Gram CL LSTM	200	1000	150	128	69	3	ptb	0.002	1.56 (only unigrams)	0.64 (only unigrams)
Paper LSTM ¹ ²	400	1000	150	128	500	3	ptb	0.001 (1/10 decay)	1	1.232

| N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 1 | | | N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 2 | |

Model and Corpus train/size metrics

Investigate the dictionary size, the models size aswell as memory usage and training time for different n-grams and UNK threshold

Hyperparamter:

Hidden size: 128
emsize: 128
nlayers: 2
batch size: 50
lr: 20

Dataset	UNK	Ngram	sec/epoch	dictionary size	GPU memory usage	Params
Wiki-2	0	1	43	1156	2553 MiB	561284
Wiki-2	5	2	67	2754	2817 MiB	971713
Wiki-2	20	2	53	1751	2761 MiB	714199
Wiki-2	40	2	48	1444	2763 MiB	635300
Wiki-2	40	3	143	9827	3301 MiB	2789721
Wiki-2	40	4	412	34124	4723 MiB	9034060
Wiki-2	40	5	782	72745	6765 MiB	18959657
Wiki-2	100	10	965	90250	8635 MiB	23458442
ptb	20	2	31	704	4893 MiB	21669504
enwik8	20	2	4493		13 GiB	120M

Plot for Wikitet-2

TODO:

Corpus aus generate.py raus
Make CPP extension work
Build flair language model
Compare epoch time between bigram and uni gram model

HallerPatrick/two_hot_encoding