fastdds 2.14.0 The shared memory mode cannot communicate after restarting the process
Closed this issue · 9 comments
Is there an already existing issue for this?
- I have searched the existing issues
Expected behavior
After the process crashes and restarts, the data of the subscribed topic can be received correctly. However, after the process crashes and restarts, it shows that the subscription topic is successful, but no data is received.
Current behavior
However, after the process crashes and restarts, it shows that the subscription topic is successful, but no data is received.
Steps to reproduce
Using the shared memory communication method, there is one publisher and one subscriber. The mouse click on the subscriber console is stuck, and then the console is closed and the subscriber is restarted. You can see that the topic subscription is successful, but no data can be received. This problem does not exist when using UDP.
Fast DDS version/commit
2.14.0 WINDOWS binary installation package downloaded from the official website
Platform/Architecture
Windows 10 Visual Studio 2019
Transport layer
Shared Memory Transport (SHM)
Additional context
FASTDDS 2.14.0
XML configuration file
<?xml version="1.0" encoding="UTF-8" ?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<participant profile_name="HydroTechSurvey">
<!-- <domainId>4</domainId> -->
<rtps>
<name>HydroTechSurvey</name>
<propertiesPolicy>
<properties>
<!-- Activate Fast DDS Statistics Module -->
<property>
<name>fastdds.statistics</name>
<value>HISTORY_LATENCY_TOPIC;NETWORK_LATENCY_TOPIC;PUBLICATION_THROUGHPUT_TOPIC;SUBSCRIPTION_THROUGHPUT_TOPIC;RTPS_SENT_TOPIC;RTPS_LOST_TOPIC;HEARTBEAT_COUNT_TOPIC;ACKNACK_COUNT_TOPIC;NACKFRAG_COUNT_TOPIC;GAP_COUNT_TOPIC;DATA_COUNT_TOPIC;RESENT_DATAS_TOPIC;SAMPLE_DATAS_TOPIC;PDP_PACKETS_TOPIC;EDP_PACKETS_TOPIC;DISCOVERY_TOPIC;PHYSICAL_DATA_TOPIC</value>
</property>
</properties>
</propertiesPolicy>
</rtps>
</participant>
<data_writer profile_name="datawriter">
<topic>
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
<qos>
<reliability>
<kind>RELIABLE</kind>
<max_blocking_time>
<sec>3</sec>
</max_blocking_time>
</reliability>
</qos>
<times> <!-- writerTimesType -->
<initialHeartbeatDelay>
<nanosec>12</nanosec>
</initialHeartbeatDelay>
<heartbeatPeriod>
<sec>3</sec>
</heartbeatPeriod>
<nackResponseDelay>
<nanosec>5</nanosec>
</nackResponseDelay>
<nackSupressionDuration>
<sec>0</sec>
</nackSupressionDuration>
</times>
<historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>
<matchedSubscribersAllocation>
<initial>10</initial>
<maximum>20</maximum>
<increment>1</increment>
</matchedSubscribersAllocation>
</data_writer>
<data_reader profile_name="datareader">
<topic>
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
<qos>
<reliability>
<kind>RELIABLE</kind>
<max_blocking_time>
<sec>15</sec>
</max_blocking_time>
</reliability>
</qos>
<times> <!-- readerTimesType -->
<initialAcknackDelay>
<nanosec>70</nanosec>
</initialAcknackDelay>
<heartbeatResponseDelay>
<nanosec>5</nanosec>
</heartbeatResponseDelay>
</times>
<expectsInlineQos>true</expectsInlineQos>
<historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>
<matchedPublishersAllocation>
<initial>10</initial>
<maximum>20</maximum>
<increment>1</increment>
</matchedPublishersAllocation>
</data_reader>
<topic profile_name="topic">
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
</profiles>
</dds>
<?xml version="1.0" encoding="UTF-8" ?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<participant profile_name="hydro_mbgeo_process">
<!-- <domainId>4</domainId> -->
<rtps>
<name>hydro_mbgeo_process</name>
<propertiesPolicy>
<properties>
<!-- Activate Fast DDS Statistics Module -->
<property>
<name>fastdds.statistics</name>
<value>HISTORY_LATENCY_TOPIC;NETWORK_LATENCY_TOPIC;PUBLICATION_THROUGHPUT_TOPIC;SUBSCRIPTION_THROUGHPUT_TOPIC;RTPS_SENT_TOPIC;RTPS_LOST_TOPIC;HEARTBEAT_COUNT_TOPIC;ACKNACK_COUNT_TOPIC;NACKFRAG_COUNT_TOPIC;GAP_COUNT_TOPIC;DATA_COUNT_TOPIC;RESENT_DATAS_TOPIC;SAMPLE_DATAS_TOPIC;PDP_PACKETS_TOPIC;EDP_PACKETS_TOPIC;DISCOVERY_TOPIC;PHYSICAL_DATA_TOPIC</value>
</property>
</properties>
</propertiesPolicy>
</rtps>
</participant>
<data_writer profile_name="datawriter">
<topic>
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
<qos>
<reliability>
<kind>RELIABLE</kind>
<max_blocking_time>
<sec>5</sec>
</max_blocking_time>
</reliability>
</qos>
<times> <!-- writerTimesType -->
<initialHeartbeatDelay>
<nanosec>12</nanosec>
</initialHeartbeatDelay>
<heartbeatPeriod>
<sec>3</sec>
</heartbeatPeriod>
<nackResponseDelay>
<nanosec>5</nanosec>
</nackResponseDelay>
<nackSupressionDuration>
<sec>0</sec>
</nackSupressionDuration>
</times>
<historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>
<matchedSubscribersAllocation>
<initial>10</initial>
<maximum>20</maximum>
<increment>1</increment>
</matchedSubscribersAllocation>
</data_writer>
<data_reader profile_name="datareader">
<topic>
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
<qos>
<reliability>
<kind>RELIABLE</kind>
<max_blocking_time>
<sec>5</sec>
</max_blocking_time>
</reliability>
</qos>
<times> <!-- readerTimesType -->
<initialAcknackDelay>
<nanosec>70</nanosec>
</initialAcknackDelay>
<heartbeatResponseDelay>
<nanosec>5</nanosec>
</heartbeatResponseDelay>
</times>
<expectsInlineQos>true</expectsInlineQos>
<historyMemoryPolicy>DYNAMIC_REUSABLE</historyMemoryPolicy>
<matchedPublishersAllocation>
<initial>10</initial>
<maximum>20</maximum>
<increment>1</increment>
</matchedPublishersAllocation>
</data_reader>
<topic profile_name="topic">
<historyQos>
<kind>KEEP_LAST</kind>
<depth>1</depth>
</historyQos>
<resourceLimitsQos>
<max_samples>1</max_samples>
<max_instances>1</max_instances>
<max_samples_per_instance>1</max_samples_per_instance>
<allocated_samples>0</allocated_samples>
<extra_samples>10</extra_samples>
</resourceLimitsQos>
</topic>
</profiles>
</dds>
Relevant log output
Output without FASTDDS enabled
Network traffic capture
No response
Hi @zhangzhen5729,
thanks for using Fast DDS.
To avoid the application to crash you can handle the signal erased when closing the terminal as:
std::function<void(int)> stop_app_handler;
void signal_handler(
int signum)
{
stop_app_handler(signum);
}
// In the application
signal(SIGTERM, signal_handler);
You would also need to clean the folder containing shared memory files because it's probably full.
Please, let us know if the problem is solved.
@elianalf If the program has no console and runs in the background, how can we solve the problem of releasing resources after an unexpected crash? This situation can also lead to successful topic subscription, but no data can be received.
Hello, I am also experiencing the problem.
@elianalf, why do we need to call fastdds shm clean command separately or manually?
can't we put this functionality when dds object releases?
If the program has no console and runs in the background, how can we solve the problem of releasing resources after an unexpected crash?
There are many signals to handle all kind of situations, even if the program has no console.
why do we need to call fastdds shm clean command separately or manually?
can't we put this functionality when dds object releases?
Fast DDS already does the cleanup and releases the resources when the application is correctly closed. When the application does not correctly close, and the error signal is not handled, the internal cleanup is not called and a manual cleanup is necessary.
If the program crashes unexpectedly and no signal is captured, and the FASTDDS resources are not released, resulting in the inability to receive data after restart, what should be done? Should FASTDDS continue to enhance the fault tolerance of unexpected crashes of SHM communication participants?
@elianalf Hello, under Windows, if the program crashes unexpectedly, how can you capture the process crash or exit signal?
It is good to know that FastDDS stores files in C:\ProgramData\eprosima\fastrtps_interprocesss
. It took me a long time to find out why FastDDS suddenly stopped working 👎. The reason was files in this directory that were not deleted because a programme had crashed. I have developed a very simple hotfix for Windows, which hopefully solves the problem permanently. It does not work for static libraries unless you call the corresponding function yourself. I have added a file dllmain.cpp
to the library:
#ifdef _WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <filesystem>
static bool win32_test_file_open(std::filesystem::path file)
{
HANDLE hFile = CreateFileA(file.string().c_str(), GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE)
return true; // File is open
else
{
// File not open
CloseHandle(hFile);
return false;
}
}
static void cleanup_eprosima()
{
std::filesystem::path interprocessdata("C:\\ProgramData\\eprosima\\fastrtps_interprocess");
if (!std::filesystem::exists(interprocessdata)) return;
for (const std::filesystem::directory_entry& entry : std::filesystem::directory_iterator(interprocessdata))
{
if (!entry.exists()) continue;
std::filesystem::path file = entry.path();
if (file.filename().string().ends_with("_el") || file.filename().string().ends_with("_mutex")) continue;
if (win32_test_file_open(file)) continue;
std::filesystem::path el(file.string() + "_el");
if (std::filesystem::exists(el)) std::filesystem::remove(el);
std::filesystem::path mutex(file.string() + "_mutex");
if (std::filesystem::exists(mutex)) std::filesystem::remove(mutex);
std::filesystem::remove(file);
}
}
BOOL APIENTRY DllMain(HMODULE hModule, DWORD ul_reason_for_call, LPVOID lpReserved)
{
switch (ul_reason_for_call)
{
case DLL_THREAD_ATTACH:
case DLL_THREAD_DETACH:
break;
case DLL_PROCESS_ATTACH:
case DLL_PROCESS_DETACH:
cleanup_eprosima();
break;
}
return TRUE;
}
#endif // _WIN32
The function checks the directory for files. If the files have already been opened by a programme with FastDDS, they are ignored, otherwise they are deleted. As this is executed before the FastDDS code, problematic files are deleted completely.
Hi @zhangzhen5729, @baynaaMN, @OgreTransporter.
Handling application signals is the responsibility of the application, not the middleware.
Hello, under Windows, if the program crashes unexpectedly, how can you capture the process crash or exit signal?
The following code is a slightly modified snippet example of signal handling taken from the Fast DDS (master) hello world example (main.cpp). It applies to Linux, MacOS, and Windows:
#include <csignal>
std::function<void(int)> stop_app_handler;
void signal_handler(
int signum)
{
stop_app_handler(signum);
}
int main(
int argc,
char** argv)
{
// App initialization
// ...
// Implementation of your signal handler
stop_app_handler = [&](int signum)
{
std::cout << "\nSignal #" << std::to_string(signum) << " received, stopping application." << std::endl;
// Call application destruction methods here
// ...
};
// Examples of handled signals, some of them are not supported in windows
signal(SIGINT, signal_handler);
signal(SIGTERM, signal_handler);
#ifndef _WIN32
signal(SIGQUIT, signal_handler);
signal(SIGHUP, signal_handler);
#endif // _WIN32
// Application loop
// ...
return 0;
}
why do we need to call fastdds shm clean command separately or manually?
There is no need if the application is correctly closed. The created files are associated with the identifiers (GUIDs) of the different DDS entities, and their corresponding ports. The newly created entities will not be allowed to overwrite the previous files, even though the identifiers and ports are the same. For that reason, those files should be removed once the entity is removed (task performed if the application is correctly closed).
If the program crashes unexpectedly and no signal is captured, and the FASTDDS resources are not released, resulting in the inability to receive data after restart, what should be done?
The application is responsible for recovering until unexpected crashes. In this recovery process, you should clean those unexpectedly closed SHM files (with fastdds shm clean
command, which applies to the previously mentioned OS Linux, MacOS, and Windows).
Therefore, I am moving this issue to the Support section according to the Fast DDS CONTRIBUTING guidelines.