Difficulties running dedupe_geojson out of the box and a possibly significant typo?
Opened this issue · 0 comments
Hi . Interesting package which does almost exactly what I need, but I've had some difficulty getting it to run in python 3
(3.8, specifically). I spent a day hacking various bits and pieces and seem to have certain slices running, and I'm happily running the libpostal/pypostal in other contexts (great stuff, thanks!) . I imagine I have some sort of installation/package dependency issue, but I also wonder of some sort of commit/update may have failed, somewhere. For example:
class GeoJSONLineParser(GeoJSONParser):
def __init__(self, filename):
if filename.endswith(".bz2"):
self.f = bz2.BZ2File(filename)
else:
self.f = open(filename)
def next_feature(self):
return json.loads(self.f.next().rstrip())
seems to be bombing with an error report:
dedupe_geojson --use-postal-code --use-zip5 --no-phone-numbers
-o foo --output-filename z1 --name-dupe-threshold 0.0 name.json
Word index file: foo/info_gain.index
Near-dupe tempfile: foo/near_dupes
Features DB: foo/features_db
Output filename: foo/z1
-----------------------------
* Assigning IDs, creating near-dupe hashes + word index (using info_gain)
Traceback (most recent call last):
File "/.local/bin/dedupe_geojson", line 299, in
for feature_id, feature in id_features(args.files):
File "/.local/bin/dedupe_geojson", line 52, in id_features
for feature in f:
TypeError: iter() returned non-iterator of type 'GeoJSONLineParser'_
which is easily enough patched/remedied with:
return json.loads(next(self.f).rstrip())
seems some sort of python2/python3 thing?!
Also, I think there may be a typo ("canoncal" instead of "canonical" ) at line 99 in https://github.com/openvenues/lieu/blob/master/scripts/dedupe_geojson
def is_name_address_dupe(canoncal, other, dupe_pairs, dupes, word_index=None,
name_dupe_threshold=DedupeResponse.default_name_dupe_threshold,
needs_review_threshold=DedupeResponse.default_name_review_threshold,
with_address=True,
with_unit=False,
use_phone_number=False,
fuzzy_street_names=False):
Before I commit to further hacking to get other slices running (haven't done anything with the geo features yet, for example), I thought I'd check to see about some combination:
- dedupe_geojson should be up and running with python 3.(?)
- If maybe some commit or installation feature had somehow failed or strayed
- Make sure lieu is still something I might expect to work.
Also, I looked around in the installation and didn't see a simple, sample input file, which would have saved me a certain amount of effort as well. As noted, I haven't sorted out all the formats and features, but in the spirit of sharing back, I attach the following json as something that seems to sort of work for me in the above call to dedupe_geojson.
name.json.gz
Thanks for your attention. Good stuff, both this and libpostal. I appreciate your sharing.