NMEA0183 serial input problem: 20231105
norbert-walter opened this issue · 13 comments
If I introduce a serial data stream via NMEA0183 into an M5Stack Atom via an RS485 unit, problems arise when the data stream is interrupted (cable unplugged, reset on the transmitting device). After the data transfer is interrupted and continued, the telegram counter for the serial port freezes. However, telegrams forwarded via USB are still displayed. The website also no longer builds properly when you refresh the page.
The same problem also occurs when incomplete, garbled or unknown telegrams are transmitted. Then the M5Stack Atom freezes completely. To do this, I intentionally short-circuited the transmission lines. The new GPS receiver sends, for example, the following telegrams immediately after it has been started:
$GPTXT,01,01,02,HW=ATGM336H,00030107486321C
$GPTXT,01,01,02,IC=AT6558-5N-31-0C510800,BMLLCKJ-D2-0374165A
$GPTXT,01,01,02,SW=URANUS5,V5.3.0.01D
$GPTXT,01,01,02,TB=2020-04-28,13:43:1040
$GPTXT,01,01,02,MO=GB77
$GPTXT,01,01,02,BS=SOC_BootLoader,V6.2.0.234
$GPTXT,01,01,02,FI=00856014*71
A bit hard to understand what exactly you did and what happend.
Let's discuss this directly...
hi
i have noticed something simular
the nmea2000 connection freezes after connecting a serial input (AIS signal)
did you find a reason for the problem described in the question
Dick Koning
No, not really.
Maybe you can fetch the log at the USB port.
Optionally increasing the log level.
Did you test this with the newest version?
Any chance for some logs?
Hi Andreas,
I had exactly the same problem when NMEA sentences were transmitted incorrectly. This means that the checksum did not match the content. In my case it was a faulty ground connection that garbled the telegrams. You can take a NMEA0183 log file and subsequently add errors to the recorded telegrams, such as inserting special characters, spaces or deleting parts of the telegram or the checksum. My examples in the first post show such telegrams. For example, they have no checksum or correspond to unknown telegrams. In my opinion, the checksum of the telegram is not checked to see whether the telegram was transmitted completely correctly. Furthermore, it is not checked whether these are known telegrams that the interpreter can translate correctly. This is exactly what causes the software to freeze. Error handling is not working properly. This is not a problem with occasional errors in telegrams. But with many consecutive telegrams.
Norbert
With some additional analysis I found basically 2 problems:
(1) when the serial rx buffers overrun the nmea messages can be garbled. This can lead to serial counter names that are invalid (e.g. containing a " sign). If this occurs the Web UI looks broken (as the status update internally creates an JS error). The device itself still runs without issues.
So one correction will be to prevent this error in the Web UI.
(2) What causes the RX fifos/ buffers to overflow? This tyipcally will happen if you enable debug output. As every invalid line will at least produce 2 lines of output this easily can cause the system to really slow down. The main loop flushes the log to the USB device. If running with the default of 115200 baud a continuous input with 38400 baud with the example data from this issue will already cause the USB channel to be at it's limit. You can see this in the Main loop line.
If you count the log bytes between 2 Main loop lines and compare this to the time diff you can easily see that this fills up 115200 baud completely.
good (idle):
Main loop 2886.00/s334.82[1542us]#1:30.35[69],2:13.14[21],3:3.44[979],4:3.52[223],5:145.97[299],6:53.15[121],7:32.12[238],8:10.72[24],9:26.04[58],10:12.99[28],11:4.00[16],
bad (38400 baud input with invalid messages):
Main loop 60.76/s14706.50[38544us]#1:13634.41[37054],2:29.25[57],3:3.46[1243],4:3.26[11],5:373.14[717],6:154.02[381],7:118.10[446],8:52.56[100],9:1064.08[1 868],10:22.97[83],11:16.09[4783],
The first line (when empty) shows 2886 main loops/second. The second one only ~60! And you can see that the phase 1(flushing the logs) already takes ~13.6ms (average) of 14.6ms main loop run time.
If you switch the log level to "log" the problem should go away.
In the correction I will add an error log if the RX fifo or the RX buffer are overflowing.
Ahh... nice that you found a problem.
Any log level lower then debug should solve the issue
Thanks for you work!