P300 does not recover from "nack, comm error"

Question

P300 does not recover from "nack, comm error"

s0170071 opened this issue 6 years ago · 22 comments

if that error was triggered once, e.g. by an read attempt to a wrong address, it does not recover from it. All consecutive read requests quit with error 3 without even attempting to send something.

Answer 1 · 2018-04-03T09:56:28.000Z

here are some of the changes I made. Sorry for the screenshot, somehow github wouldn't let me push to my own repo and create a pull from there.

It works now even if you try to poll invalid data points. What I didn't check is what happens if a real communications error occurs.

Answer 2 · 2018-04-03T12:00:03.000Z

Before you're able to push, you have to fork. On your own fork you'll be able to push (after creating a bugfix branch?) and from there you can open a pull request to this repo.

I'll be happy to assist (although I'm absolutely no expert either!)

Answer 3 · 2018-04-03T20:27:40.000Z

I don't understand why the check _queue.empty() is needed.
There are only 3 (4) possibilities:

queue is not empty and Optolink is not busy --> handle next
Optolink was succesfull --> process and remove from queue
Optolink returned an error --> read error and remove from queue
(queue is not empty and optolink is still busy --> do nothing)
In case 2 and 3, the queue will be checked in the next run and if empty, no new action will be processed hence optolinik.available() will only return 0 unless there's something in the queue.

I'm not saying there's no bug, I just don't see the error.

Answer 4 · 2018-04-04T06:37:57.000Z

If the queue is empty you go to

if (_optolink.available() > 0) {

and the crash happens when you do _queue.front().DP->callback(value);
If the queue is empty, front() returns null? and your call to DP->callback() crashes.

void VitoWifiInterface<OptolinkP300>::loop() {
  _optolink.loop();

  if (_queue.empty()) return;
  
  if (!_queue.empty() && !_optolink.isBusy()) {
    if (_queue.front().write) {
      _optolink.writeToDP(_queue.front().DP->getAddress(), _queue.front().DP->getLength(), _queue.front().value);
    } else {
      _optolink.readFromDP(_queue.front().DP->getAddress(), _queue.front().DP->getLength());
    }
    return;
  }
 
  if (_optolink.available() > 0) {  // trigger callback when ready and remove element from queue
    _logger.print(F("DP "));
    _logger.print(_queue.front().DP->getName());
    _logger.println(F(" succes"));
    uint8_t value[4] = {0};
    _optolink.read(value);
    _queue.front().DP->callback(value);
    _queue.pop();
    return;
  }
  if (_optolink.available() < 0) {  // display error message and remove element from queue
    _logger.print(F("DP "));
    _logger.print(_queue.front().DP->getName());
    _logger.print(F(" error: "));
    uint8_t errorCode = _optolink.readError();
    _logger.println(errorCode, DEC);
    _queue.pop();
    return;
  }
}

Answer 5 · 2018-04-04T06:49:57.000Z

True, but optolink should not be available (< or >) when the queue is empty. The last element from the queue is only removed after calling the callback. And in the next run, the queue is empty but optolink should return 0.

But it may have something to do with the buggy error handling of optolink itself.

Answer 6 · 2018-04-05T17:20:07.000Z

I'm currently testing with this branch. I removed the retries so VitoWifi just moves on to the next DP and returns the error code.

Answer 7 · 2018-04-05T17:26:27.000Z

Sounds reasonable as this is no recoverable communications error.
Will try, just not this week...

Answer 8 · 2018-04-05T17:46:48.000Z

No worries, I'm busy the following week with other priorities...

I actually connected my optolink very badly (a lot of "optical noise" I think) and it runs fine.
Often it skips a DP because of various errors (checksum, timeout...)

Answer 9 · 2018-04-11T18:12:39.000Z

Ah, just found the remove-retry branch. Seems to work. nack message appears but reading resumes.

Answer 10 · 2018-04-12T14:05:30.000Z

One thing to do: VitoWifi doesn't recover when the connection is lost/never has been made.

Answer 11 · 2018-04-15T20:32:45.000Z

Optolink still didn't recover from all errors, but I think I got it covered in the latest commit. (033b09f)

Answer 12 · 2018-04-16T16:55:15.000Z

My setup is running for 20+ hours now with a poorly connected optolink (= lot's of errors) but is running smoothly.
Only possible issue is that the queue could grow when the optolink is not connected and refresh rate < ( timout * number of DPs ). But I would solve that by setting a max size of the queue and rejecting requests when full.

Answer 13 · 2018-04-16T16:57:20.000Z

Will test it next weekend.
Did you also consider misconfiguration?

Answer 14 · 2018-04-16T17:16:53.000Z

I get checksum errors, fault code returns, lenght errors and timeouts (on connection, data and acks).

Answer 15 · 2018-04-17T05:58:30.000Z

Sounds great.

Answer 16 · 2018-04-23T08:38:20.000Z

Did you had any chance to test the new version (branch).

Answer 17 · 2018-04-23T10:33:24.000Z

Not yet. There was too much sunshine over the weekend.

Answer 18 · 2018-04-23T10:48:31.000Z

😄 I totally understand. I was busy firing up the BBQ myself.

Answer 19 · 2018-04-23T15:33:08.000Z

just tested, works as far as I can see. Now the COP change is missing, but that’s another branch.

Answer 20 · 2018-04-23T16:11:37.000Z

OK, then I'll merge. Should there be a bug, feel free to create a new issue!

BTW, I'm completely reworking the DP management. And I'll add a "RAW" datapoint to be able to probe unknown addresses or sizes.

Answer 21 · 2018-04-26T18:24:48.000Z

Looking forward to that raw point. Some thoughts on that:

it should use the number of transmitted bytes as advertised by the protocol
it should do all sorts of meaningful conversions (e.g. 4 data bytes - may be an int32 or uint32) and put them into a struct.
Log output: showing all converted numbers. Usually a human can spot whats meaningful and what not.

Answer 22 · 2018-04-26T19:30:24.000Z

Let us continue that discussion in #10