/linzaddressparse

A simple address parser for New Zealand (LINZ) addresses.

Primary LanguageScala

LINZ Address Parser

This repo contains code that can be used to train a bi-directional LSTM model using LINZ addresses, which we can then use to parse address strings into constituent components. We treat an adddress string as a sequence of characters, and we seek to assign a label to each. E.g.

So, for example, road name is then just the ordered set of characters that have been assigned the label of road_name--['M', 'e', 'm', 'o', 'r', 'i', 'a', 'l'], in this case.

Note that the method is similar to AddressNet for Australian addresses.

Getting Started

The library is provided as an sbt project. A fat jar can be made simply by running:

sbt assembly

The library uses deeplearning4j which is a relatively large library. If we build a fat jar with dl4j included it will be relatively large--about 1.4GB. Also, if training a model, things will run much faster using CUDA. To do this, make sure the file ./project/Dependencies looks as follows:

import sbt._

object Dependencies {
  /*
  lazy val dl4jcore = "org.deeplearning4j" % "deeplearning4j-core" % "1.0.0-M1.1"
  lazy val nd4j = "org.nd4j" % "nd4j-native-platform" % "1.0.0-M1.1"
  */
  /**/
  lazy val dl4jcore = "org.deeplearning4j" % "deeplearning4j-cuda-11.2" % "1.0.0-M1.1"
  lazy val nd4j = "org.nd4j" % "nd4j-cuda-11.2-platform" % "1.0.0-M1.1"
  /**/

  ...

}

It's up to the user to ensure you have a compatible CUDA installation available. That said, I found NVIDIA's CUDA toolkit and Docker images the easiest path, and a siimple Docker setup is provided for those interested. Once a mepeodel is trained, and all users wish to do is parse individual addresses, the CPU version will suffice. In that case, just ensure ./project/Dependencies.scala looks as follows:

object Dependencies {
  /**/
  lazy val dl4jcore = "org.deeplearning4j" % "deeplearning4j-core" % "1.0.0-M1.1"
  lazy val nd4j = "org.nd4j" % "nd4j-native-platform" % "1.0.0-M1.1"
  /**/
  /*
  lazy val dl4jcore = "org.deeplearning4j" % "deeplearning4j-cuda-11.2" % "1.0.0-M1.1"
  lazy val nd4j = "org.nd4j" % "nd4j-cuda-11.2-platform" % "1.0.0-M1.1"
  */

  ...

}

To train a model, it's easiest to just use the sbt console:

$ sbt
$ console

To train a model, we then just do this:

import org.cmhh.linzaddressparse._

val m = model.lstm(Vocab.size, Labels.size)
val it = AddressDataSetIterator.train
m.fit(it)
m.save(new File("model.mdl"))

The trained model can then be used as follows:

utils.parse("1/45A Memorial Avenue, Ilam, Christchurch 8053")(m).toJson
{
  "unit_type":null,
  "unit_value":"1",
  "level_type":null,
  "level_value":null,
  "address_number":"45",
  "address_number_suffix":"A",
  "address_number_high":null,
  "road_name":"Memorial",
  "road_type_name":"Avenue",
  "road_suffix":null,
  "suburb_locality":"Ilam",
  "town_city":"Christchurch",
  "postcode":"8053"
}

A pre-trained model is provided as src/main/resources/model.mdl, as is a very basic entrypoint which can make use of it:

java -cp target/scala-2.13/linzaddressparse.jar \
  org.cmhh.linzaddressparse.AddressParse \
  "1/45A Memorial Avenue, Ilam, Christchurch 8053"
{
  "unit_type":null,
  "unit_value":"1",
  "level_type":null,
  "level_value":null,
  "address_number":"45",
  "address_number_suffix":"A",
  "address_number_high":null,
  "road_name":"Memorial",
  "road_type_name":"Avenue",
  "road_suffix":null,
  "suburb_locality":"Ilam",
  "town_city":"Christchurch",
  "postcode":"8053"
}