bertmelis/VitoWiFi

P300 does not recover from "nack, comm error"

s0170071 opened this issue · 22 comments

if that error was triggered once, e.g. by an read attempt to a wrong address, it does not recover from it. All consecutive read requests quit with error 3 without even attempting to send something.

here are some of the changes I made. Sorry for the screenshot, somehow github wouldn't let me push to my own repo and create a pull from there.

image

image

It works now even if you try to poll invalid data points. What I didn't check is what happens if a real communications error occurs.

Before you're able to push, you have to fork. On your own fork you'll be able to push (after creating a bugfix branch?) and from there you can open a pull request to this repo.

I'll be happy to assist (although I'm absolutely no expert either!)

I don't understand why the check _queue.empty() is needed.
There are only 3 (4) possibilities:

  1. queue is not empty and Optolink is not busy --> handle next
  2. Optolink was succesfull --> process and remove from queue
  3. Optolink returned an error --> read error and remove from queue
  4. (queue is not empty and optolink is still busy --> do nothing)
    In case 2 and 3, the queue will be checked in the next run and if empty, no new action will be processed hence optolinik.available() will only return 0 unless there's something in the queue.

I'm not saying there's no bug, I just don't see the error.

If the queue is empty you go to

if (_optolink.available() > 0) {

and the crash happens when you do _queue.front().DP->callback(value);
If the queue is empty, front() returns null? and your call to DP->callback() crashes.

void VitoWifiInterface<OptolinkP300>::loop() {
  _optolink.loop();

  if (_queue.empty()) return;
  
  if (!_queue.empty() && !_optolink.isBusy()) {
    if (_queue.front().write) {
      _optolink.writeToDP(_queue.front().DP->getAddress(), _queue.front().DP->getLength(), _queue.front().value);
    } else {
      _optolink.readFromDP(_queue.front().DP->getAddress(), _queue.front().DP->getLength());
    }
    return;
  }
 
  if (_optolink.available() > 0) {  // trigger callback when ready and remove element from queue
    _logger.print(F("DP "));
    _logger.print(_queue.front().DP->getName());
    _logger.println(F(" succes"));
    uint8_t value[4] = {0};
    _optolink.read(value);
    _queue.front().DP->callback(value);
    _queue.pop();
    return;
  }
  if (_optolink.available() < 0) {  // display error message and remove element from queue
    _logger.print(F("DP "));
    _logger.print(_queue.front().DP->getName());
    _logger.print(F(" error: "));
    uint8_t errorCode = _optolink.readError();
    _logger.println(errorCode, DEC);
    _queue.pop();
    return;
  }
}

True, but optolink should not be available (< or >) when the queue is empty. The last element from the queue is only removed after calling the callback. And in the next run, the queue is empty but optolink should return 0.

But it may have something to do with the buggy error handling of optolink itself.

I'm currently testing with this branch. I removed the retries so VitoWifi just moves on to the next DP and returns the error code.

Sounds reasonable as this is no recoverable communications error.
Will try, just not this week...

No worries, I'm busy the following week with other priorities...

I actually connected my optolink very badly (a lot of "optical noise" I think) and it runs fine.
Often it skips a DP because of various errors (checksum, timeout...)

Ah, just found the remove-retry branch. Seems to work. nack message appears but reading resumes.

One thing to do: VitoWifi doesn't recover when the connection is lost/never has been made.

Optolink still didn't recover from all errors, but I think I got it covered in the latest commit. (033b09f)

My setup is running for 20+ hours now with a poorly connected optolink (= lot's of errors) but is running smoothly.
Only possible issue is that the queue could grow when the optolink is not connected and refresh rate < ( timout * number of DPs ). But I would solve that by setting a max size of the queue and rejecting requests when full.

Will test it next weekend.
Did you also consider misconfiguration?

I get checksum errors, fault code returns, lenght errors and timeouts (on connection, data and acks).

Sounds great.

Did you had any chance to test the new version (branch).

Not yet. There was too much sunshine over the weekend.

😄 I totally understand. I was busy firing up the BBQ myself.

just tested, works as far as I can see. Now the COP change is missing, but that’s another branch.

OK, then I'll merge. Should there be a bug, feel free to create a new issue!

BTW, I'm completely reworking the DP management. And I'll add a "RAW" datapoint to be able to probe unknown addresses or sizes.

Looking forward to that raw point. Some thoughts on that:

  1. it should use the number of transmitted bytes as advertised by the protocol
  2. it should do all sorts of meaningful conversions (e.g. 4 data bytes - may be an int32 or uint32) and put them into a struct.
  3. Log output: showing all converted numbers. Usually a human can spot whats meaningful and what not.

Let us continue that discussion in #10