ADLINK-IST/opensplice

Debugging erratic DDS behaviour

Opened this issue · 6 comments

Until very recently, the system I was maintaining was running OSPL 6.7180404OSS. Part of this has been rebuilt recently against 6.9.210323OSS and I have begun to see some extremely erratic behaviour that almost certainly borders on broken, however I am clueless as to how to debug such a scenario.

Essentially there is a few C++ apps, one of which is publishing to a topic. There are other Java apps that use the supplied JNI layer that subscribe to this topic. When problems arose, I turned to a quick tool that was written to show all publishers and subscribers for a given topic and I observed the following:

  • After the publisher is started, I see 1 publisher;
  • After the first subscriber is started, I see 1 publisher and 1 subscriber;
  • After the second subscriber is started, I see 1 publisher and 1 subscriber for a while, then I see 2 subscribers and 0 publishers;
  • After the third subscriber is started, I see anything from 1-3 subscribers with 0 publishers or 1 publisher with 0 subscribers.

This feels like the whole messaging layer is malformed somehow, however there are no ospl error logs. The only things of note in the info logs are missed heartbeats, which seems to occur when new publishers/subscribers are started.

I am looking to see if anyone would put a finger on any particular area to investigate if anyone had happened to see such behaviour before. At the moment given the limited tools and knowledge at my disposal, it may have to abort back to 6.7 unless I can ascertain what the actual issue may be.

Dear Richard,

May i know your ospl configuraion? Please share us your previous and latest ospl configuration.

With best regards,
Vivek Pandey
ADLINK Technology

The configuration is as follows (which has not changed between 6.7 and 6.9):

<OpenSplice>
    <Domain>
        <Name>rdghickman</Name>
        <Id>224</Id>
        <SingleProcess>true</SingleProcess>
        <Description>Stand-alone 'single-process' deployment and standard DDSI networking.</Description>
        <Service name="ddsi2">
            <Command>ddsi2</Command>
        </Service>
        <Service enabled="false" name="cmsoap">
            <Command>cmsoap</Command>
        </Service>
        <Report verbosity="INFO"/>
    </Domain>
    <DDSI2Service name="ddsi2">
        <General>
            <NetworkInterfaceAddress>eth0</NetworkInterfaceAddress>
            <AllowMulticast>true</AllowMulticast>
            <EnableMulticastLoopback>true</EnableMulticastLoopback>
            <CoexistWithNativeNetworking>false</CoexistWithNativeNetworking>
        </General>
        <Compatibility>
            <!-- see the release notes and/or the OpenSplice configurator on DDSI interoperability -->
            <StandardsConformance>lax</StandardsConformance>
            <!-- the following one is necessary only for TwinOaks CoreDX DDS compatibility -->
            <!-- <ExplicitlyPublishQosSetToDefault>true</ExplicitlyPublishQosSetToDefault> -->
        </Compatibility>
        <Discovery>
            <MaxAutoParticipantIndex>39</MaxAutoParticipantIndex>
        </Discovery>
    </DDSI2Service>
    <TunerService name="cmsoap">
        <Server>
            <PortNr>Auto</PortNr>
        </Server>
    </TunerService>
</OpenSplice>

This runs on x64 Linux built with GCC 10.

After further analysis our current belief is that this only affects Java applications. C++ applications seem fine which may indicate an issue with the 6.9 JNI layer or its configuration, but it is difficult to be certain.

It seems wired to me why it is not working because we didn't find any such problem in the OSPL 6.9.210323OSS.

You could try with our Vortex Opensplice DDS Commercial 6.11.x ( https://www.adlinktech.com/en/vortex-data-distribution-service-dds-software-evaluation ).

With best regards,
Vivek

Have now confirmed that reverting to 6.7.180404OSS with the same code appears to fix this issue, even with GCC10, so we will have to abandon 6.9.210323OSS at this time and look for another solution.

After extensive analysis I believe I have actually found the problem. I don't believe it is actually a 6.7 vs 6.9 issue, but it is a subtle mis-use of DDS from the Java side which seems to create a catastrophic problem inside DDS. Unfortunately OpenSplice does not show up with any warnings or errors that I can find, which resulted in our assumption that the upgrade of OpenSplice caused the failure. While it appeared to occur in 6.9 and not 6.7, what is required is a Java application using LivelinessChangedStatus, which crept in during the shift to 6.9. Allow me to clarify.

We have a wrapper around the very old Java 5 DDS API, one of which is a reader-type class which supports a liveliness callback:

public LivelinessOpenspliceDdsReader(Subscriber subscriber, Topic<T> topic, DdsReceiver<R> receiver, MessageConverter<T, R> converter,
        DdsLivelinessListener livelinessListener, QosPolicy.ForDataReader... policies)
    {
        super(subscriber, topic, receiver, converter, Arrays.asList(DataAvailableStatus.class, LivelinessChangedStatus.class), policies);
        this.livelinessListener = Objects.requireNonNull(livelinessListener);
    }

    @Override
    public void onLivelinessChanged(LivelinessChangedEvent<T> status)
    {
        super.onLivelinessChanged(status);
        livelinessListener.livelinessChanged(status.getStatus().getAliveCount());
    }

This code appears innocent enough and is completely valid Java however this causes what seems to be a massive problem. The base class does the following:

protected BaseOpenspliceDdsReader(Subscriber subscriber, Topic<T> topic, DdsReceiver<R> receiver, MessageConverter<T, R> converter,
        Collection<Class<? extends Status>> status, QosPolicy.ForDataReader... policies)
    {
        this.subscriber = Objects.requireNonNull(subscriber);
        this.converter = Objects.requireNonNull(converter);
        this.receiver = Objects.requireNonNull(receiver);
        this.dataReader = subscriber.createDataReader(topic,
            subscriber.copyFromTopicQos(subscriber.getDefaultDataReaderQos().withPolicies(policies),
                topic.getQos()), this, status);
    }

It appears that createDataReader ends up calling down into native code, with the status variable in this case including LivelinessChangedStatus.class. This completes the constructor for the base class but not the constructor for the child class. The native code appears to then callback into the Java side, hitting the void onLivelinessChanged(LivelinessChangedEvent<T> status) method in the child class, however since the child class has not finished its constructor, the livelinessListener member variable has not yet been initialised.

It appears this is causing some significant problem inside the DDS layer at some point, however there are no warnings or errors. If the LivelinessChangedStatus.class is removed then no problems are seen because presumably the void onLivelinessChanged(LivelinessChangedEvent<T> status) is not being accessed and generating what is most likely a NullPointerException.

Given that the Java written is valid and there does not appear to be anything stopping JNI from calling into an object's methods before initialisation is complete, may I perhaps suggest the following:

  • Make it extremely clear in some appropriate documentation that the methods of a DataReaderListener<TYPE> may be called, even during initialisation;
  • Add at least some manner of warning or error to the OSPL log files if this kind of callback failure occurs to aid in tracking down the problem.

Dear Richard,

Thank you for your comment. I really appreciate your suggestion. We will test the case and consider your workaround, if required.

With best regards,
Vivek K Pandey
Solutions Architect
ADLINK Technology, SAT