apache/celix

pubsub_zmq aborts when running within a container

rlenferink opened this issue · 2 comments

The pubsub_zmq tests fail (SEGV) when running within a container. This is due to the user in the container possibly being the root user (uid = 0), which makes this check succeed:

//NOTE. ZMQ will abort when performing a sched_setscheduler without permission.
//As result permission has to be checked first.
//TODO update this to use cap_get_pid and cap-get_flag instead of check user is root (note adds dep to -lcap)
bool gotPermission = false;
if (getuid() == 0) {
gotPermission = true;
}

The gotPermission is later on used to determine whether the scheduling priority can be set:

zmq_ctx_set(receiver->zmqCtx, ZMQ_THREAD_PRIORITY, (int) prio);

When this is called with the user root within a container (uid 0), but the user outside the container being a rootless user, the tests segfault (unable to call pthread_setschedparam).

This is the line where libzmq in the end crashes:

https://github.com/zeromq/libzmq/blob/4097855ddaaa65ed7b5e8cb86d143842a594eebd/src/thread.cpp#L345

libzmq doesn't handle this too nicely and I am not sure whether this can be solved.

I tried with the suggest libcap and after that simply falling back to using the capsh command, but there the cap_sys_nice can be set:

root@fedora:/home/rlenferink/workspace/asf/celix/celix-container# capsh --print
Current: =ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore

Any suggestions to solve this?

I would like to drop support for PubSub bundles for Apache Celix 3.0.0 and if we do that, IMO this does not need to be solved.

If we would like to keep the PubSub bundles, I think the best solution is only set ZMQ_THREAD_PRIORITY or ZMQ_THREAD_SCHED_POLICY if this is explicitly enabled through a config property.

It is said by the documentation that the host machine's kernel should be configured properly(CONFIG_RT_GROUP_SCHED): https://docs.docker.com/config/containers/resource_constraints/#configure-the-realtime-scheduler
And my local Ubuntu does not support this.

PubSub correctly provides configuration options for this. It seems to me a pure testing configuration issue: an additional CMake option like RUN_IN_CONTAINER(and corresponding Conan option) should be enough to control these tests to use another set of *.properties.