pelias/whosonfirst

Import Who's on First venues

orangejulius opened this issue · 6 comments

Who's on First now includes many venues. The data is split across several hundred repos in the whosonfirst-data Github organization, so a big challenge will simply be gathering all the data. Several of the repositories use git-lfs as well.

On the importer side, we are currently able to squeeze all the WOF administrative area records into memory, which obviously won't work with millions of venues.

has to be done to allow for dev work

  • create script to set up test data directory with example venue data (no longer needed, because the venue bundles are published and can be downloaded directly)
  • publish WIP code to support better memory management in importer (#119, related to #7)

has to be done before production readiness

  • improve WOF venue generation code to build more venue bundles (preferably we get one giant venue bundle) (this has been done by the Who's on First team!)
  • change list of WOF layers in API so venue and address records in WOF can be queried (similar to pelias/api#569) (fixed in pelias/api#645)
  • ensure admin lookup code doesn't try to load all the venues if it's pointed at a directory containging WOF admin and venue data (update: it doesn't, because like the wof importer, wof-pip-service uses the meta files to know what to load)
  • (Mapzen Search specific) update chef scripts to download venue bundles OR update chef scripts to use Javascript downloader, and update that to include venue data (preferred). For now, this can just be a big hardcoded list
  • Update whosonfirst repo readme with venue information
  • Introduce configuration option to disable importing venues (#142)
  • review any acceptance test failures with a full planet build and resolve them to our satisfaction
  • Ensure performance is reasonable, since this may add 10-15 million new records!
  • Update the installation docs and data sources docs

can be done as follow up improvements

  • Rewrite downloader script (again) to be faster and smarter about downloading venue data (#135)
  • write code that allows us to handle street addresses in WOF records. this can mirror the address duplicating code in the OSM importer
  • import category tags and normalize them to the common taxonomy (this still needs some definition). The category info will live in https://github.com/pelias/categories soon

I was poking around in the venue data recently and noticed that there are some Manhattan records with multiple hierarchies that are also placed in New Jersey.

Are there enough that reporting them and fixing them manually(-ish) would be difficult?

I found 4090 just in that area but am working on a script to check elsewhere.

Taking a look at the acceptance tests, there are 5 different issues happening. You can compare against dev2 as of this writing (October 13, 2016) to see the difference.

Daly City

I believe this is a variant on the issue where we almost never return admin areas for autocomplete queries with a focus. There were already venues being returned ahead of daly city, now there are just more.

4th and King

There's a new entry for the 4th and king transit station in SF. This one is probably ok.

Newfoundland and Labrador

screenshot from 2016-10-13 19-21-26

The scores for the venues that start with "Newfoundland and Labrador" are actually identical to the region. Perhaps we should apply a small boost to all admin areas? Even a 1.1x boost here would be enough. I'll investigate later

Maui, Hawaii

screenshot from 2016-10-13 19-31-27

This actually has nothing to do with the duplicate Maui, it appears that it's simply because "Maui Maui" is shorter than "Maui County", and so the relevance score is higher. Other "Maui XXXX" results show up with a tied score. Here the score is significantly higher for "Maui Maui", so I don't think we can boost our way out of it. One solution might be to add "Maui" as an alt name for the county, but this would mean we can't fix it until next quarter.

New South Wales

We already return the Geonames record for New South Wales first, but it has the name "State of New South Wales". It's boosted by the population, but the WOF record (name: "New South Wales") has no population info). I think this one is ok, and additionally we can and should add the population data to WOF.

Summary

Other than Maui Maui, most of these are easily fixable.

I have received word from the WOF team that WOF Venues are pretty low priority for them, as there's lots of other work to be done. At this time enabling venue imports should still be as easy as toggling a config flag (importVenues in pelias.json). We welcome reports of how well this works out for people, but don't intend to support it as a production-ready configuration any time soon.

After some recent discussion it sounds like we have no plans to continue supporting WOF venue downloads going forward. The new data hosting for Who's on First sponsored by Geocode Earth is not going to publish them, and we expect to remove support for this functionality in this importer.