/machine-learning-gazetteer-matching

Cross-gazetteer record linking of natural features in Switzerland using machine learning (random forests) and handcrafted rules.

Primary LanguageHTML

machine-learning-gazetteer-matching

This repo features code, annotated data, and results for the IJGIS paper Machine learning for cross-gazetteer matching of natural features.

Notebooks

Jupyter notebooks are in the top-level of this repo, numbered according to the order in which they should be run, and organized into 3 numbered subsets:

  • 0_ : (00, 01, 02): preparation, preprocessing
  • 1_ : (10, 11, 12, 13, 14): rule-based matching
  • 2_ : (20, 21): machine learning based matching using random forests

Note these notebooks rely heavily on code in the gazmatch folder.

Data

In /data/, we share our annotated data, annotated_sample.csv as well as some serialized files, including test_set_ids.pkl for the feature-type-balanced test set used in a subset of experiments. The latest GeoNames and SwissNames3D data can be obtained online:

Note these datasets will not be identical to the ones used in this paper, which were downloaded in 2017. In particular, SwissNames3D may change UUIDs for certain records in newer versions. Data preparation involving the raw datasets is described and performed in the preparation notebooks. Contact the first author of the associated paper with any data requests.

Results

The /results/ folder contains tsv files used to plot the results in the paper. The /html_exports/ contains html exports of all the notebooks for easy viewing in a browser.