tonton81/FlexCAN_T4

isotp losing frames when multiple id's being received at once

fredmcc opened this issue · 4 comments

I've been working on this one for a while now and can't seem to solve it.

My best guess right now is that when a full isotp frame has been received and the slot that is no longer required is being deleted with _rx_slots.findRemove either the wrong slot is being deleted or something gets corrupted within circular buffer.
As you can see below once 0x7EE has been fully received 0x7EA can no longer be found in the circular buffer.
This behavior is pretty stable but sometimes a different id has the problem. Just based on the receive order I think.

So I think this is only happening when there are multiple active slots in circular buffer and a message is fully received.

As a test I commented out the line where the slot is removed after the isotp frame is complete to see if that was the cause
//_rx_slots.findRemove(data, sizeof(data), 0, 1, 2, 3, 3);
With this commented out all frames were received, but it obviously leads to other problems.

Any ideas on what to check. Circular buffer is pretty complicated.

Monitor MB:9:7EE 10 17 49 0A 01 47 50 43 mS-2329
first frame ......
Monitor MB:9:7E8 10 17 49 0A 01 45 43 4D mS-2332
first frame ......
Monitor MB:9:7EA 10 17 49 0A 01 54 43 4D mS-2342
first frame ......
Monitor MB:9:7EE 21 4D 2D 47 6C 6F 77 50 mS-2350
Consecutive frame found.....
Monitor MB:9:7E8 21 00 2D 45 6E 67 69 6E mS-2351
Consecutive frame found.....
Monitor MB:9:7EE 22 6C 75 67 43 74 72 6C mS-2359
Consecutive frame found.....
Monitor MB:9:7EA 21 00 2D 54 72 61 6E 73 mS-2359
Consecutive frame found.....
Monitor MB:9:7EE 23 00 00 00 AA AA AA AA mS-2369
Consecutive frame found.....
<-ISO_CAN : 7EE->49 0A 01 47 50 43 4D 2D 47 6C 6F 77 50 6C 75 67 43 74 72 6C 00 00 00 - 2369
Monitor MB:9:7E8 22 65 43 6F 6E 74 72 6F mS-2370
Consecutive frame found.....
Monitor MB:9:7EA 22 6D 69 73 43 74 72 6C mS-2378
Consecutive frame found.....
_rx_slots.find - Failed !!!!!!
Monitor MB:9:7E8 23 6C 00 00 AA AA AA AA mS-2390
Consecutive frame found.....
<-ISO_CAN : 7E8->49 0A 01 45 43 4D 00 2D 45 6E 67 69 6E 65 43 6F 6E 74 72 6F 6C 00 00 - 2390
Monitor MB:9:7EA 23 00 00 00 AA AA AA AA mS-2397
Consecutive frame found.....
_rx_slots.find - Failed !!!!!!

did you try to add a separation_time? demo includes 10ms for sending, this stabilizes for high traffic and concurrent frames

same for reception, you'd send send a flow control with a separation_time set

also findRemove doesn't run outside of isr context, and is removed before the usercallback is fired, it will delete on an out of sequence which is normal but thats why you can try a flow control separation_time.

i mean by the time you exit your callback, and check your mailboxes for frames, you gotta also make sure you don't miss any in that timeframe. that's why separation_time helps

here is a good example. if you are using fifo, frames are in order, with or without separation_time the isotp sequence are ordered. Now, if you have couple mailboxes receiving frames, they can be in ANY order, and if the sequence is out of order, the isotp data is aborted and removed from queue

@tonton81 Pretty sure I found the issue in circular buffer.
When .findRemove is called it finds the entry and then calls .remove(pos).
However I think it is calling it with the wrong parameter.
.remove is looking for the position like this if ( ((head+i)&(_size-1)) == pos ) {

but .findremove is sending something different.
if ( found >= 0 ) {
remove(found); Shouldn't this be remove(((head+found)&(_size-1))); When I do this it works properly
return 1;
}

So I can see when 3 message slots are in play .findRemove is finding the right slot but an error is occurring when it calls remove and the wrong slot is removed and then that message fails to be received.

FULL FUNCTIONS BELOW

template<typename T, uint16_t _size, uint16_t multi>
bool Circular_Buffer<T, _size, multi>::findRemove(T *buffer, uint16_t length, int pos1, int pos2, int pos3, int pos4, int pos5) {
uint8_t input_count = 3;
int32_t found = -1;

if ( pos4 != -1 ) input_count = 4;
if ( pos5 != -1 ) input_count = 5;
for ( uint16_t j = 0; j < _available; j++ ) {
switch ( input_count ) {
case 3: {
if ( _cabuf[ ((head+j)&(_size-1)) ][pos1+2] == buffer[pos1] && _cabuf[ ((head+j)&(_size-1)) ][pos2+2] == buffer[pos2] &&
_cabuf[ ((head+j)&(_size-1)) ][pos3+2] == buffer[pos3] ) {
found = j;
break;
}
}
case 4: {
if ( _cabuf[ ((head+j)&(_size-1)) ][pos1+2] == buffer[pos1] && _cabuf[ ((head+j)&(_size-1)) ][pos2+2] == buffer[pos2] &&
_cabuf[ ((head+j)&(_size-1)) ][pos3+2] == buffer[pos3] && _cabuf[ ((head+j)&(_size-1)) ][pos4+2] == buffer[pos4] ) {
found = j;
break;
}
}
case 5: {
if ( _cabuf[ ((head+j)&(_size-1)) ][pos1+2] == buffer[pos1] && _cabuf[ ((head+j)&(_size-1)) ][pos2+2] == buffer[pos2] &&
_cabuf[ ((head+j)&(_size-1)) ][pos3+2] == buffer[pos3] && _cabuf[ ((head+j)&(_size-1)) ][pos4+2] == buffer[pos4] &&
_cabuf[ ((head+j)&(_size-1)) ][pos5+2] == buffer[pos5] ) {
found = j;
break;
}
}
}
if ( found >= 0 ) {
remove(found);
return 1;
}
}
return 0;
}

template<typename T, uint16_t _size, uint16_t multi>
bool Circular_Buffer<T,_size,multi>::remove(uint16_t pos) {
if ( multi ) {
if ( pos >= _size ) return 0;

int32_t find_area = -1;

for ( uint16_t i = 0; i < _size; i++ ) {
  if ( ((head+i)&(_size-1)) == pos ) {
    find_area = i;
    break;
  }
}
if ( find_area == -1 ) return 0;

while ( ((head+find_area)&(_size-1)) != ((head)&(_size-1)) ) {
  memmove(_cabuf[((head+find_area)&(_size-1))],_cabuf[((head+find_area-1)&(_size-1))], (2+multi)*sizeof(T));
  find_area--;
}
head = ((head + 1)&(2*_size-1));
_available--;
return 1;

}
return 0;
}

the remove takes an indice, never seen this issue as 2 libraries use this feature, but if you can confirm it's working thats good because the only thing i use isotp for is talking to one esp32

Its pretty common in vehicles to have multiple modules respond to a singe OBD request with 0x7DF address.
In this case 3 responding at once, and that's when the issue shows itself. If the modules responded 1 after the other had finished the issue wouldn't show up.
Works perfect now with all the modules responses being correctly received every time.