/Text-Normalization-Demo

Demonstration of the results in "Text Normalization using Memory Augmented Neural Networks", Authors: Subhojeet Pramanik, Aman Hussain

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Text Normalization using Memory Augmented Neural Networks

The Text Normalization Demo notebook and the accompanying paper "Text Normalization using Memory Augmented Neural Networks" demonstrates an accuracy of 99.5% on the Text Normalization Challenge by Richard Sproat and Navdeep Jaitly. An earlier version of the approach used here has secured the 6th position in the Kaggle Russian Text Normalization Challenge by Google's Text Normalization Research Group.

Go straight to the Text Normalization Demo Notebook

Architecture

Two models are used for the purpose of text normalization. A XGBoost boost model first classifies a token as to-be-normalized or remain-self. The to-be-normalized tokens are then fed character-by-character to our proposed Sequence to Sequence DNC model.

More details about the architecture and implementation can be found in the original paper.

Sequence to sequence DNC

Sequence to sequence DNC

Results :

1. Normalizing English Data

Semiotic Classwise Accuracy

semiotic-class accuracy count correct
ALL 0.994267233453397 92451 91921
ADDRESS 1.0 4 4
CARDINAL 0.9942140790742526 1037 1031
DATE 0.9971751412429378 2832 2824
DECIMAL 0.9891304347826086 92 91
DIGIT 0.7954545454545454 44 35
ELECTRONIC 0.7346938775510204 49 36
FRACTION 0.6875 16 11
LETTERS 0.971611071682044 1409 1369
MEASURE 0.971830985915493 142 138
MONEY 0.972972972972973 37 36
ORDINAL 0.9805825242718447 103 101
PLAIN 0.9939611747724394 67894 67484
PUNCT 0.9988729854615125 17746 17726
TELEPHONE 0.918918918918919 37 34
TIME 0.75 8 6
VERBATIM 0.994005994005994 1001 995

2. Normalizing Russian Data

Semiotic Classwise Accuracy

semiotic-class accuracy count correct
ALL 0.9928752306965964 93196 92532
CARDINAL 0.9417922948073701 2388 2249
DATE 0.9732441471571907 1495 1455
DECIMAL 0.9 60 54
DIGIT 1.0 16 16
ELECTRONIC 0.6041666666666666 48 29
FRACTION 0.6086956521739131 23 14
LETTERS 0.9907608695652174 1840 1823
MEASURE 0.8978102189781022 411 369
MONEY 0.8947368421052632 19 17
ORDINAL 0.9461358313817331 427 404
PLAIN 0.994688407139769 64764 64420
PUNCT 0.9998519542045006 20264 20261
TELEPHONE 0.8202247191011236 89 73
TIME 0.75 8 6
VERBATIM 0.9985119047619048 1344 1342

How to run?

Requirements:

Follow these steps for a demonstration:

  1. Clone the repo
  2. Download and extract the required data.
$ sh setup.sh
  1. Create & activate an environment using the provided file
$ conda env create -f environment.yml
$ source activate deep-tf
  1. Start a Jupyter Notebook server
  2. Open 'notebooks/Text Normalization Demo.ipynb'
  3. Set the language to English or Russian below the 'Global Config' cell
lang = 'english'
# lang = 'russian'
  1. Run the notebook

Full Requirements:

  • numpy 1.13.3
  • pandas 0.21.0
  • matplotlib 2.1.0
  • watermark 1.5.0
  • seaborn 0.8.1
  • sklearn 0.19.1
  • xgboost 0.6
  • tensorflow 1.3.0

Authors

  1. Subhojeet Pramanik (http://github.com/subho406)
  2. Aman Hussain (https://github.com/AmanDaVinci)

Acknowledgements

Differentiable Neural Computer, Tensorflow Implementation: https://github.com/deepmind/dnc