HTenkanen/pyrosm

custom_filter get_pois not working with value True

mattijsdp opened this issue · 6 comments

Describe the bug
The custom_filter does not work as one would expect when using the catch-all 'True' as a value in combination with another tag key. If I understand the documentation correctly, then this should return all osm records with that tag key and this return more entities compared to when one specifies a tag key and value.

To Reproduce

from pyrosm import OSM, get_data
import shapely
import h3

# Some h3 cell in New York
bb = shapely.Polygon(h3.h3_to_geo_boundary("882a107749fffff", geo_json=True))

state = "New York"
fp = get_data("New York")
osm = OSM(fp, bounding_box=bb)

# Get all buildings and "railway": "station"
pois = osm.get_pois(custom_filter={"building": True, "railway": ["station"]})
print(len(pois)) # 4

# Get all "building": "yes" and "railway": "station"
pois = osm.get_pois(custom_filter={"building": ["yes"], "railway": ["station"]})
print(len(pois)) # 2152

Expected behavior
Using True should always lead to a geq number of records than specifying some tag value.

Environment:

  • OS: Linux
  • Python package source (PyPi, conda, ...): pyrosm, 0.6.2, py311hb755f60_1, conda-forge
  • Versions of Python, Java Development Kit, Python modules: 3.11.6

Additional context
I realise there is a function called get_buildings() which would be more suited if you are only interested in buildings. I am interested in POIs but some buildings I would also like to include. In general, I am using a somewhat involved custom_filter with True and this example shows that I either don't understand the implementation or there is a bug.

Great work on the package though, very useful!!

I believe it is because in the line below, the tag values of True aren't added to filter_keys nor data_filter. If all tag values are True then this is fine as data_filter is None is handled below that but not when it is a combination.
https://github.com/HTenkanen/pyrosm/blob/66de74bd0496d1148618842cac58923bf22d97ea/pyrosm/data_filter.pyx#L112C13-L112C13

I see that overlapping_filter also plays a part. I think this should be set to True if there are any tags that are set to value True (catch-all) so that the for loop here is never broken early in these cases.

I'm getting odd results with using the custom_filter for the get_boundaries method too.

Filtering with an osm id, yields more results than the same request without a custom filter.

my_filter = {'id':['639343']}
admin_boundaries = osm_data_obj.get_boundaries(custom_filter = my_filter)

yields 378 results that do not have the id '639343'. Without a custom filter there are only 245 results.

Similar things are true with filters:
{'name':['Regierungsbezirk Karlsruhe']} yields 245 results, includes only 1 with that exact name
{'admin_level':['6']} yields 245 results as well. includes only 10 with that admin_level

using a custom filter for the get_data_by_custom_criteria method seems to work as expected however.

@Chwiggy could you share the bounding box or area you're using?

This is the geometry i was using:

{ "type": "Feature",
    "properties": { "name": "2023_rnv_gtfs.zip", "path": "filepath" },
    "geometry": { 
         "type": "Polygon",
         "coordinates": [ [ [ 8.1704019, 49.3432631 ], [ 8.811395, 49.3432631 ], [ 8.811395, 49.5811911 ], [ 8.1704019, 49.5811911 ], [ 8.1704019, 49.3432631 ] ] ]
    } 
}

the osm.pbf file was cropped to this bounding box from a larger file with osmosis with completeWays=yes

@Chwiggy please see my MR for a fix.

Your use case works as expected though. It is just because get_boundaries adds additional tag key:value combinations as can be seen here. If the boundary tag key isn't specified then it is added as True thereby including all osm entities with the tag boundary