openvenues/lieu

Difficulties running dedupe_geojson out of the box and a possibly significant typo?

Opened this issue · 0 comments

Hi . Interesting package which does almost exactly what I need, but I've had some difficulty getting it to run in python 3
(3.8, specifically). I spent a day hacking various bits and pieces and seem to have certain slices running, and I'm happily running the libpostal/pypostal in other contexts (great stuff, thanks!) . I imagine I have some sort of installation/package dependency issue, but I also wonder of some sort of commit/update may have failed, somewhere. For example:

class GeoJSONLineParser(GeoJSONParser):
    def __init__(self, filename):
        if filename.endswith(".bz2"):
            self.f = bz2.BZ2File(filename)
        else:
            self.f = open(filename)

    def next_feature(self):
        return json.loads(self.f.next().rstrip())

seems to be bombing with an error report:

dedupe_geojson --use-postal-code --use-zip5 --no-phone-numbers
-o foo --output-filename z1 --name-dupe-threshold 0.0 name.json
Word index file: foo/info_gain.index
Near-dupe tempfile: foo/near_dupes
Features DB: foo/features_db
Output filename: foo/z1
-----------------------------
* Assigning IDs, creating near-dupe hashes + word index (using info_gain)
Traceback (most recent call last):
File "/.local/bin/dedupe_geojson", line 299, in
for feature_id, feature in id_features(args.files):
File "/.local/bin/dedupe_geojson", line 52, in id_features
for feature in f:
TypeError: iter() returned non-iterator of type 'GeoJSONLineParser'_


which is easily enough patched/remedied with:

return json.loads(next(self.f).rstrip())

seems some sort of python2/python3 thing?!


Also, I think there may be a typo ("canoncal" instead of "canonical" ) at line 99 in https://github.com/openvenues/lieu/blob/master/scripts/dedupe_geojson

def is_name_address_dupe(canoncal, other, dupe_pairs, dupes, word_index=None,
                         name_dupe_threshold=DedupeResponse.default_name_dupe_threshold,
                         needs_review_threshold=DedupeResponse.default_name_review_threshold,
                         with_address=True,
                         with_unit=False,
                         use_phone_number=False,
                         fuzzy_street_names=False):

Before I commit to further hacking to get other slices running (haven't done anything with the geo features yet, for example), I thought I'd check to see about some combination:

  1. dedupe_geojson should be up and running with python 3.(?)
  2. If maybe some commit or installation feature had somehow failed or strayed
  3. Make sure lieu is still something I might expect to work.

Also, I looked around in the installation and didn't see a simple, sample input file, which would have saved me a certain amount of effort as well. As noted, I haven't sorted out all the formats and features, but in the spirit of sharing back, I attach the following json as something that seems to sort of work for me in the above call to dedupe_geojson.
name.json.gz

Thanks for your attention. Good stuff, both this and libpostal. I appreciate your sharing.