ad-freiburg/osm2rdf

Segfault during DAG build persists

Closed this issue · 6 comments

Unfortunately, the latest weekly run crashed this morning with a segmentation fault during the DAG build for the USA dataset. This seems to be the same bug we encountered on the planet.osm dataset in autumn, and which we thought to be fixed now. I am atm trying to reproduce this in gdb.

The segfault occurred again this weekend, and again, multiple attempts since yesterday to reproduce this with gdb/valgrind/thread-sanitizer failed. I now manually went through the code line by line and found one line where an access to the first box ID of a geometry was not protected by a check whether the geometry has any box IDs at all. Normally, each geometry should be assigned at least one box ID, as the grid covers the entire globe. However, it might be that in extreme edge cases (for example, a polygon lying completely on an edge of the grid) a geometry may not have any box ID at all. This case is now caught with 755e5af.

I now restarted the build with 755e5af, let's see what happens.

Update: this wasn't the cause.

Additional runs of 755e5af today on our ob machine (✓ = success without any error/warning, ✗ = segfault) :

  • usa.osm, in gdb, with -g: ✓
  • usa.osm, in gdb, without -g: ✓
  • usa.osm, with thread-sanitizer: ✓
  • usa.osm, release build: ✗

I will create an artificial USA datasets containing only named areas tomorrow and test it with valgrind.

ATM, our weekly update runs entirely within gdb.

Still no luck reproducing this, even with synthetic datasets, and with the removal of all sanity checks we have in prepareDAG(). So far, I wasn't even able to reproduce it with master, on the same machine (ob), with the same command line and build parameters as for the weekly build, but outside of a Docker environment. I now bumped the Ubuntu version in the Docker container to 22.04, let's see how this behaves.

I was now finally able to reproduce this in a single-threaded environment (so no data races) on the ABB (Asia) dataset with basic debug output.

The crash occurred during a check between these areas:

https://www.openstreetmap.org/relation/5615030
https://www.openstreetmap.org/relation/1949881

The latter crosses the dateline and thus the boundaries of the mercator projection, so there is certainly something special about it. How that relates to the crash remains to be investigated :)

Current state, all with prepareDAG() single-threaded, processing order of areas deterministic.

Note that from Boost 1.78 upwards, GEOMETRYCOLLECTIONS() are supported and enabled.

Machine Env Mode Dataset Crash?
ob Docker (Ubuntu 22.04, Boost 1.78) -O3 abb.osm.pbf Yes
ob Docker (Ubuntu 22.04, Boost 1.78) -g, gdb abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.74) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.79) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.80) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.84) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.84) -O3 abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.79) -O3 abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.78) -g -fsanitize=address abb.osm.pbf Yes ††
ob Docker (Ubuntu 22.04, Boost 1.78) -g, gdb abb.osm.pbf TODO
ob Docker (Ubuntu 22.04, Boost 1.74) -g, gdb abb.osm.pbf TODO
ob bare (Ubuntu 22.04, Boost 1.74) -O3 abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address abb.osm.pbf TODO
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind abb.osm.pbf TODO, will take very long (weeks)
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind geo.osm.pbf (box around Antimeridian) No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address geo.osm.pbf (box around Antimeridian) No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.78) -g, valgrind 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, valgrind 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, valgrind geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, gdb 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, gdb geo.osm.pbf (box around Antimeridian) No

† Segfault after checking the areas 5615030, 1949881 as described above, possibly during stack cleanup. Exactly reproduced every time (tested 4x)
†† AddressSanitizer failed to allocate 0x1f000 (126976) bytes at address fe31ea9b000 (errno: 12), during dataset loading via libosmium, took 4 days

The latest build (now with Boost 1.84) ran through without problems. TLDR: with Boost 1.78, the code segfaulted every time during the comparison of areas 5615030 and 1949881 in a single-threaded environment, but only if nobody looked (it ran through fine with gdb, valgrind, and with the thread sanitizer enabled). Later and earlier Boost versions worked fine.

Closing this now, although I am still not 100% convinced that the cause is not in our code.