fledge-iot/fledge

Crash in DataLoad::readBlock()

Closed this issue · 1 comments

Context

In a north plugin, the thread loading data from the database can crash the whole north plugin if there is an error during the database read.

Here is an example of logs that can be found when it happens:

Fledge iec104north_c1[3883]: ERROR: Failed to fetch readings: End of file
Fledge iec104north_c1[3883]: DEBUG: DataLoad::readBlock(): Got 1649533134 readings from storage client
Fledge iec104north_c1[3883]: FATAL: Signal 11 (Segmentation fault) trapped:
Fledge iec104north_c1[3883]: FATAL: (0) 0 0x55eae96cc4b2 handler(int) + 82---------
Fledge iec104north_c1[3883]: FATAL: (1) 1 0x7ff636a3d090 /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7ff636a3d090]---------
Fledge iec104north_c1[3883]: FATAL: (2) 2 0x7ff6341cf5da IEC104PivotFilter::ingest(ReadingSet*) + 234---------
Fledge iec104north_c1[3883]: FATAL: (3) 3 0x55eae96c28fc DataLoad::bufferReadings(ReadingSet*) + 348---------
Fledge iec104north_c1[3883]: FATAL: (4) 4 0x55eae96c5028 DataLoad::readBlock(unsigned int) + 728---------
Fledge iec104north_c1[3883]: FATAL: (5) 5 0x55eae96c5222 ./fledge.services.north(+0x3b222) [0x55eae96c5222]---------
Fledge iec104north_c1[3883]: FATAL: (6) 6 0x7ff636cdfdf4 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7ff636cdfdf4]---------
Fledge iec104north_c1[3883]: FATAL: (7) 7 0x7ff637258609 /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff637258609]---------
Fledge iec104north_c1[3883]: FATAL: (8) 8 0x7ff636b19133 clone + 67---------

As we can see here some variable containing invalid data is being read: There is no way the plugin had 1649533134 reading stored somewhere.

Steps to reproduce

  • I am not sure how this "Failed to fetch readings" log happened in my case, maybe it was due to some process being killed during a Fledge restart? Anyway if you want to make sure to enter the same course of action, you may want to edit DataLoad::readBlock() function and make sure an exception occur while the readings variable is being filled, so that this varialbe is no longer null but contains garbage data.
  • Start fledge with a north plugin and the modified north service from above.
  • Wait for Fledge to attempt data collection in the noth plugin.
  • Notice the crash.

Expected behavior

The north plugin does not crash.

Code analysis

So far it appears that exceptions happening during the phase where readings is filled are simply ignored and the content of the variable is processed anyway even if something went wrong and it may contain corrupted data:

void DataLoad::readBlock(unsigned int blockSize)
{
ReadingSet *readings = NULL;
int n_waits = 0;

	do
	{
		try
		{
			switch (m_dataSource)
			{
				case SourceReadings:
					// Logger::getLogger()->debug("Fetch %d readings from %d", blockSize, m_lastFetched + 1);
					readings = m_storage->readingFetch(m_lastFetched + 1, blockSize);
					break;
				case SourceStatistics:
					readings = fetchStatistics(blockSize);
					break;
				case SourceAudit:
					readings = fetchAudit(blockSize);
					break;
				default:
					Logger::getLogger()->fatal("Bad source for data to send");
					break;

			}
		}
	       	catch (ReadingSetException* e)
		{
			// Ignore, the exception has been reported in the layer below
		}
	       	catch (exception& e)
		{
			// Ignore, the exception has been reported in the layer below
		}
		if (readings && readings->getCount())
		{
			Logger::getLogger()->debug("DataLoad::readBlock(): Got %d readings from storage client", readings->getCount());
			m_lastFetched = readings->getLastId();
			bufferReadings(readings);
			if (m_perfMonitor)
			{
				m_perfMonitor->collect("No of waits for data", n_waits);
				m_perfMonitor->collect("Block utilisation %", (readings->getCount() * 100) / blockSize);
			}
			return;
		}
		else if (readings)
		{
			// Delete the empty readings set
			delete readings;
		}
		else
		{
			// Logger::getLogger()->debug("DataLoad::readBlock(): No readings available");
		}
		if (!m_shutdown)
		{	
			// TODO improve this
			this_thread::sleep_for(chrono::milliseconds(250));
			n_waits++;
		}
	} while (m_shutdown == false);
}

Suggested correction

Ensure that readings is empty when something went wrong and also ensure that readings cannot contain data from previous loop:

void DataLoad::readBlock(unsigned int blockSize)
{
int n_waits = 0;

	do
	{
		ReadingSet *readings = nullptr;
                try
		{
			switch (m_dataSource)
			{
				case SourceReadings:
					// Logger::getLogger()->debug("Fetch %d readings from %d", blockSize, m_lastFetched + 1);
					readings = m_storage->readingFetch(m_lastFetched + 1, blockSize);
					break;
				case SourceStatistics:
					readings = fetchStatistics(blockSize);
					break;
				case SourceAudit:
					readings = fetchAudit(blockSize);
					break;
				default:
					Logger::getLogger()->fatal("Bad source for data to send");
					break;

			}
		}
	       	catch (ReadingSetException* e)
		{
			// Ignore, the exception has been reported in the layer below
                        // readings may contain erroneous data, clear it
                        readings = nullptr;
		}
	       	catch (exception& e)
		{
			// Ignore, the exception has been reported in the layer below
                        // readings may contain erroneous data, clear it
                       readings = nullptr;
		}
		if (readings && readings->getCount())
		{
			Logger::getLogger()->debug("DataLoad::readBlock(): Got %d readings from storage client", readings->getCount());
			m_lastFetched = readings->getLastId();
			bufferReadings(readings);
			if (m_perfMonitor)
			{
				m_perfMonitor->collect("No of waits for data", n_waits);
				m_perfMonitor->collect("Block utilisation %", (readings->getCount() * 100) / blockSize);
			}
			return;
		}
		else if (readings)
		{
			// Delete the empty readings set
			delete readings;
		}
		else
		{
			// Logger::getLogger()->debug("DataLoad::readBlock(): No readings available");
		}
		if (!m_shutdown)
		{	
			// TODO improve this
			this_thread::sleep_for(chrono::milliseconds(250));
			n_waits++;
		}
	} while (m_shutdown == false);
}

Test OK, issue can closed