Seg fault when writing topic
FirebarSim opened this issue · 4 comments
Approximately 50% of the time when writing an OARIS Sample a Seg fault occurs on writing. The python script hangs for upwards of 60 seconds and then exits reporting a Segmentation Fault as follows:
Fatal Python error: Segmentation fault
Current thread 0x00007f63247cd1c0 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/cyclonedds/pub.py", line 189 in write
File "/home/local_admin/Python DDS Test/tx_test.py", line 190 in <module>
Extension modules: cyclonedds._clayer (total: 1)
Segmentation fault (core dumped)
I have also run the same program using gdb which returns the following
Starting program: /usr/bin/python3 /home/local_admin/Python\ DDS\ Test/tx_test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
INFO: Cleaning up old OARIS Version
INFO: Instantiated OARIS 3 Wavefront
INFO: Reading configuration data file.
[New Thread 0x7ffff5eff640 (LWP 6522)]
[New Thread 0x7ffff56fe640 (LWP 6523)]
[New Thread 0x7ffff4efd640 (LWP 6524)]
[New Thread 0x7fffeffff640 (LWP 6525)]
[New Thread 0x7fffef7fe640 (LWP 6526)]
INFO: Writing sensor_track_type(additional_information=[], amplitude=sensor_track_amplitude_type[Union](False, None), covariance_matrix=sensor_track_covariance_matrix_type[Union](False, None), environment=sensor_track_environment_type[Union](False, None), initiation_mode=sensor_track_initiation_mode_type[Union](False, None), jammer_indication=False, max_range_limit=sensor_track_max_range_limit_type[Union](False, None), position=position_coordinate_type[Union](<coordinate_kind_type.CARTESIAN: 0>, cartesian_position_type(x_coordinate=0, y_coordinate=0, z_coordinate=cartesian_position_z_coordinate_type[Union](None, None))), position_accuracy=position_accuracy_coordinate_type[Union](<position_accuracy_coordinate_switch_type.position_accuracy_coordinate_type_cartesian_position_accuracy_kind: 0>, cartesian_position_accuracy_type(x_coordinate_accuracy=0, y_coordinate_accuracy=0, z_coordinate_accuracy=cartesian_position_accuracy_z_coordinate_accuracy_type[Union](True, 0))), position_accuracy_coordinate_system=sensor_track_position_accuracy_coordinate_system_type[Union](False, None), position_coordinate_system=coordinate_specification_type(kind=<coordinate_kind_type.CARTESIAN: 0>, orientation=<coordinate_orientation_type.NORTH_EAST_DOWN: 5>, origin=<coordinate_origin_type.PLATFORM_REFERENCE_POINT: 3>), priority=sensor_track_priority_type[Union](False, None), sensor_track_id=1, sensor_track_pre_identification=sensor_track_sensor_track_pre_identification_type[Union](False, None), sensor_track_pre_recognition=sensor_track_sensor_track_pre_recognition_type[Union](False, None), simulated=False, time_of_first_detection=sensor_track_time_of_first_detection_type[Union](False, None), time_of_information=139391397242289552, time_of_initiation=sensor_track_time_of_initiation_type[Union](False, None), time_of_last_detection=sensor_track_time_of_last_detection_type[Union](False, None), track_phase=<track_phase_type.TRACKED: 3>, track_quality=sensor_track_track_quality_type[Union](False, None), velocity=velocity_coordinate_type[Union](<coordinate_kind_type.CARTESIAN: 0>, cartesian_velocity_type(x_dot=0, y_dot=0, z_dot=cartesian_velocity_z_dot_type[Union](None, None))), velocity_accuracy=sensor_track_velocity_accuracy_type[Union](False, None), velocity_accuracy_coordinate_system=sensor_track_velocity_accuracy_coordinate_system_type[Union](False, None), velocity_coordinate_system=coordinate_specification_type(kind=<coordinate_kind_type.CARTESIAN: 0>, orientation=<coordinate_orientation_type.NORTH_EAST_DOWN: 5>, origin=<coordinate_origin_type.PLATFORM_REFERENCE_POINT: 3>), activity_id=sensor_track_activity_id_type[Union](False, None), sensor_function_id=[], observed_function_id=sensor_track_observed_function_id_type[Union](False, None), equipment_id=sensor_track_equipment_id_type[Union](False, None), platform_id=sensor_track_platform_id_type[Union](False, None), based_on=[], external_track_number=[], subsystem_id=2)
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
dds_is_get4 (is=0x7fffffffd660) at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream.c:310
310 uint32_t v = * ((uint32_t *) (is->m_buffer + is->m_index));
(gdb) bt
#0 dds_is_get4 (is=0x7fffffffd660)
at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream.c:310
#1 dds_stream_extract_key_from_data_skip_sequence (ops=0x55555617cd2c, is=0x7fffffffd660)
at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream.c:3767
#2 dds_stream_extract_key_from_data_skip_adr (is=0x7fffffffd660, ops=0x55555617cd2c, type=7)
at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream.c:3804
#3 0x00007ffff6f588a7 in dds_stream_extract_key_from_data1 (is=0x7fffffffd660, os=0x7fffffffd640,
allocator=0x7ffff7bd8d90 <cdrstream_allocator>, ops=<optimised out>, mutable_member=false,
keys_remaining=0x7fffffffd5f4, n_keys=<optimised out>,
mutable_member_or_parent=<optimised out>, op0=<optimised out>)
at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream_keys.part.h:328
#4 0x00007ffff6f60421 in dds_stream_extract_key_from_data (is=is@entry=0x7fffffffd660,
os=os@entry=0x7fffffffd640, allocator=allocator@entry=0x7ffff7bd8d90 <cdrstream_allocator>,
desc=0x555555e32610)
at /home/local_admin/Downloads/cyclonedds-master/src/core/cdr/src/dds_cdrstream_keys.part.h:370
#5 0x00007ffff7bd1c51 in ddspy_serdata_populate_key (this=this@entry=0x555555b6ef60)
at clayer/pysertype.c:80
#6 0x00007ffff7bd2278 in ddspy_serdata_populate_key (this=0x555555b6ef60) at clayer/pysertype.c:359
#7 serdata_from_sample (type=0x555555e32590, kind=SDK_DATA, sample=0x7fffffffd7e0)
at clayer/pysertype.c:349
#8 0x00007ffff6ff4526 in ddsi_serdata_from_sample (sample=0x7fffffffd7e0, kind=SDK_DATA,
type=0x555555e32590)
at /home/local_admin/Downloads/cyclonedds-master/src/security/api/../../core/ddsi/include/dds/ddsi/ddsi_serdata.h:307
#9 dds_write_impl_make_serdata (statusinfo=0, timestamp=1719846924229763893, heap_loan=0x0,
data=0x7fffffffd7e0, sdkind=SDK_DATA, sertype=0x555555e32590)
at /home/local_admin/Downloads/cyclonedds-master/src/core/ddsc/src/dds_write.c:637
#10 dds_write_impl_psmxloan_serdata (serdata=<synthetic pointer>, psmx_loan=<synthetic pointer>,
statusinfo=0, timestamp=1719846924229763893, sdkind=<optimised out>, data=0x7fffffffd7e0,
wr=0x5555560f38f0)
at /home/local_admin/Downloads/cyclonedds-master/src/core/ddsc/src/dds_write.c:736
#11 dds_write_impl (wr=wr@entry=0x5555560f38f0, data=data@entry=0x7fffffffd7e0,
timestamp=1719846924229763893, action=action@entry=DDS_WR_ACTION_WRITE)
at /home/local_admin/Downloads/cyclonedds-master/src/core/ddsc/src/dds_write.c:809
#12 0x00007ffff6ff4a72 in dds_write (writer=<optimised out>, data=data@entry=0x7fffffffd7e0)
at /home/local_admin/Downloads/cyclonedds-master/src/core/ddsc/src/dds_write.c:47
#13 0x00007ffff7bd1a55 in ddspy_write (self=<optimised out>, args=<optimised out>)
at clayer/pysertype.c:954
#14 0x00005555556ae138 in ?? ()
#15 0x00005555556a4a7b in _PyObject_MakeTpCall ()
#16 0x000055555569d096 in _PyEval_EvalFrameDefault ()
#17 0x00005555556ae9fc in _PyFunction_Vectorcall ()
#18 0x000055555569745c in _PyEval_EvalFrameDefault ()
#19 0x00005555556939c6 in ?? ()
#20 0x0000555555789256 in PyEval_EvalCode ()
#21 0x00005555557b4108 in ?? ()
#22 0x00005555557ad9cb in ?? ()
#23 0x00005555557b3e55 in ?? ()
#24 0x00005555557b3338 in _PyRun_SimpleFileObject ()
#25 0x00005555557b2f83 in _PyRun_AnyFileObject ()
#26 0x00005555557a5a5e in Py_RunMain ()
#27 0x000055555577c02d in Py_BytesMain ()
#28 0x00007ffff7c29d90 in __libc_start_call_main (main=main@entry=0x55555577bff0,
argc=argc@entry=2, argv=argv@entry=0x7fffffffe0a8) at ../sysdeps/nptl/libc_start_call_main.h:58
#29 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x55555577bff0, argc=2, argv=0x7fffffffe0a8,
init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7fffffffe098)
at ../csu/libc-start.c:392
#30 0x000055555577bf25 in _start ()
I initially thought that I might be generating the sample incorrectly, but I am somewhat confused by the sample occasionally writing. Any input gratefully received!
The program is really simple, just creating and then writing the above sample. The cyclonedds config file is completely default apart from specifying a NIC. This latest (downloaded 1 Jul 24) version of cyclonedds and cyclonedds-python with the IDL compiled against the same.
After digging a lot further it appears that this issue is probably related to the use of sequences in the OARIS sensor_track_type struct. When the sequences are empty they appear not to deserialise correctly (or maybe serialise) and the bytes are offset, leading to incorrect behaviour.
After digging a lot further it appears that this issue is probably related to the use of sequences in the OARIS sensor_track_type struct. When the sequences are empty they appear not to deserialise correctly (or maybe serialise) and the bytes are offset, leading to incorrect behaviour.
Interesting ... I went down another path and found something that I thought completely explained it. My analysis is that the crash is caused because the Python serializer can produce garbage while the (small amount of) C support code in the Python binding assumes its input is well-formedness.
The Python code gets a value from the application and serializes it to CDR, and then the C code extracts the key value from the CDR. There's no user/network input involved between the output of the one and the input of the other, so assuming the input is well-formed is a seemingly reasonable assumption.
However ... here we have an incorrect application sample (assuming I transcribed the output you quoted correctly), because
position_accuracy=position_accuracy_coordinate_type(
discriminator=position_accuracy_coordinate_switch_type.position_accuracy_coordinate_type_cartesian_position_accuracy_kind,
value=cartesian_position_accuracy_type(
x_coordinate_accuracy=0,
y_coordinate_accuracy=0,
z_coordinate_accuracy=cartesian_position_accuracy_z_coordinate_accuracy_type(discriminator=True, value=0))),
should've been:
position_accuracy=sensor_track_position_accuracy_type(
discriminator=True,
value=position_accuracy_coordinate_type(
discriminator=position_accuracy_coordinate_switch_type.position_accuracy_coordinate_type_cartesian_position_accuracy_kind,
value=cartesian_position_accuracy_type(
x_coordinate_accuracy=0,
y_coordinate_accuracy=0,
z_coordinate_accuracy=cartesian_position_accuracy_z_coordinate_accuracy_type(discriminator=True, value=0)))),
The Python serializer gets to the position_accuracy
field, gets the discriminant from the union, which should have been a boolean but actually is position_accuracy_coordinate_switch_type.position_accuracy_coordinate_type_cartesian_position_accuracy_kind
. It can't find that discriminator value in the list of labels of the field's position_accuracy_coordinate_type
type because it isn't True
, and so it serializes the discriminant and nothing else.
It just so happens that the (little-endian) serialized representation of position_accuracy_coordinate_switch_type.position_accuracy_coordinate_type_cartesian_position_accuracy_kind
can be interpreted as the serialized representation of True
, and so on extracting the key the C deserializer goes down a different path than the serializer and expects the position accuracy information.
Thus, the two go out of sync. At some point the input sequence makes no sense anymore: it reads a byte encoding boolean but gets the number 144. The C verifier stops right there, but the key extraction procedure doesn't because it assumes well-formed input.
We probably should run the verifier just in case. I don't know Python well enough to even know whether this type confusion problem can be fixed in Python at all.
Perhaps you can give #261 a try? I have done a lot of testing, but it does detect the malformed CDR for this particular sample and ends up raising a "bad parameter" exception, which I think is correct.
Hah, you've been just slightly quicker at finding that than I have! Just identified the same thing after going text blind staring at the output. I'll have a look at 261 and see if it rejects the bad input.
@eboasson #261 does rectify the issue by giving an alert related to a bad parameter. I think when I was looking yesterday after a day of bashing my head against why this was only working sometimes I completely missed the malformatted parameter. There must have been a few occasions where the data that was fed in lined up just enough to make things nearly work which was throwing me off entirely.