/whosonfirst

Importer for Who's on First gazetteer

Primary LanguageJavaScriptMIT LicenseMIT

This repository is part of the Pelias project. Pelias is an open-source, open-data geocoder originally sponsored by Mapzen. Our official user documentation is here.

Pelias Who's on First Data Importer

Greenkeeper badge

Overview

pelias-whosonfirst is a tool used for importing data from the Who's On First project from local files into a Pelias ElasticSearch store.

Requirements

Node.js is required.

See Pelias software requirements for required and recommended versions.

Quickstart Usage

To install the required Node.js module dependencies, download data for the entire planet (20GB+) and execute the importer, run:

npm install
npm run download
npm start

Configuration

This importer is configured using the pelias-config module. The following configuration options are supported by this importer.

imports.whosonfirst.datapath

  • Required: yes
  • Default: ``

Full path to where Who's on First data is located (note: the included downloader script will automatically place the WOF data here, and is the recommended way to obtain WOF data)

imports.whosonfirst.importPlace

  • Required: no
  • Default: ``

Set to a WOF ID or array of IDs to import data only for descendants of those records, rather than the entire planet.

You can use the Who's on First Spelunker or the source_id field from any WOF result of a Pelias query to determine these values.

Specifying a value for importPlace will download the full planet SQLite database (27GB). Support for individual country downloads may be added in the future

imports.whosonfirst.importVenues

  • Required: no
  • Default: false

Set to true to enable importing venue records. There are over 15 million venues so this option will add substantial download and disk usage requirements.

It is currently not recommended to import venues.

imports.whosonfirst.importPostalcodes

  • Required: no
  • Default: false

Set to true to enable importing postalcode records. There are over 3 million postal code records.

Setting this option to true is well tested and may become the default in the future.

imports.whosonfirst.missingFilesAreFatal

  • Required: no
  • Default: false

Set to true for missing files from Who's on First bundles to stop the import process.

This flag is useful if you consider it vital that all Who's on First data is successfully imported, and can be helpful to guard against incomplete downloads or other types of failure.

imports.whosonfirst.maxDownloads

  • Required: no
  • Default: 4

The maximum number of files to download simultaneously. Higher values can be faster, but can also cause donwload errors.

imports.whosonfirst.dataHost

  • Required: no
  • Default: https://dist.whosonfirst.org/

The location to download Who's on First data from. Changing this can be useful to use custom data, pin data to a specific date, etc.

imports.whosonfirst.sqlite

  • Required: no
  • Default: false

Set to true to use Who's on First SQLite databases instead of GeoJSON bundles.

SQLite databases take up less space on disk and can be much more efficient to download and extract.

This option may become the default in the near future.

However, both the Who's on First processes to generate these files and the Pelias code to use them is new and not yet considered production ready.

Downloading the Data

The download script will download the required bundles/sqlite databases into the datapath configured in imports.whosonfirst.datapath.

To install the required node module dependencies and run the download script:

npm install
npm run download

## or

npm run download -- --admin-only # to only download hierarchy data, without venues or postalcodes

Note: The download script will always download data for the entire planet. Support for downloading data for specific countries is a possible future enhancement.

When using imports.whosonfirst.importPlace, a new SQLite database will only be downloaded if new data is available. Otherwise, the existing download will be reused.

Warning: Who's on First data is big. Just the hierarchy data is tens of GB, and the full dataset is over 100GB on disk. Additionally, Who's on First uses one file per record. In addition to lots of disk space, you need lots of free inodes. On Linux/Mac, df -ih can show you how many free inodes you have.

Expect to use a few million inodes for Who's on First. You probably don't want to store multiple copies of the Who's on First data due to its disk requirements.

Types

There are two major categories of Who's on First data supported: hierarchy (or admin) data, and venues.

Hierarchy data represents things like cities, countries, counties, boroughs, etc.

Venues represent individual places like the Statue of Liberty, a gas station, etc. Venues are subdivided by country, and sometimes regions within a country.

Currently, the supported hierarchy types are:

  • borough
  • continent
  • country
  • county
  • dependency
  • disputed
  • empire
  • localadmin
  • locality
  • macrocounty
  • macrohood
  • macroregion
  • marinearea
  • neighbourhood
  • ocean
  • region
  • postalcodes (optional, see configuration)

Other types may be included in the future.

The Who's on First documentation has a description of all the types supported by Who's on First.

In Other Projects

This project exposes a number of node streams for dealing with Who's on First data and metadata files:

  • metadataStream: streams rows from a Who's on First metadata file
  • parseMetaFiles: CSV parse stream configured for metadata file contents
  • loadJSON: parallel stream that asynchronously loads GeoJSON files
  • recordHasIdAndProperties: rejects Who's on First records missing id or properties
  • isActiveRecord: rejects records that are superseded, deprecated, or otherwise inactive
  • isNotNullIslandRelated: rejects Null Island and other records that intersect it (currently just postal codes at 0/0)
  • recordHasName: rejects records without names
  • conformsTo: filter Who's on First records on a predicate (see lodash's conformsTo for more information)