Streampunk/ledger

Health check states and unplug the network test.

Closed this issue · 4 comments

The node API and its connection to must handle the case where the network cable is unplugged and then plugged back in again, re-registering all of its resources with the registry.

This behavior must match that specified in the NMOS registration and discovery documentation on Error Conditions.

Starting an investigation into the compliance of the mdns-js library to the DNS-SD specification.

mdns-js advertisements and browsers end up in a stale state after stops and restarts. An attempt to resolve this issue can be found in pull request mdns-js/node-mdns-js#56 ... but I don't think this pull request is complete and is failing CI tests.

A number of stability fixes have been added that improve matters.

  • Shutting down any of the the APIs with Ctrl-C now causes a remove message to be sent out.
  • Starting and stopping a registry process will cause the NodeAPI to disconnect and reconnect once a registry is available - subject to no network connection issue.
  • Disconnecting the network to the registry causes the node to HTTP timeout.
    • If the network is back before the timeout, all is OK.
    • If the timeout fires, the internally restarted MDNS browser does not find the registration service. dns-sd does see remove and add events but mdns-js does not send an update event.
  • Disconnecting the network to the node causes a more instantaneous connection failure. Although dns-sd gets remove and add events, mdns-js is not sending an update event.

Further investigation of the behavior of mdns-js in the event of a network outage is required. I am making comparisons with the behaviour of dns-sd via Wireshark to see if I can understand the difference.

Some more notes to help me understand what is going on. As ever, two issues are conflicting with one another:

  1. When the network cable is disconnected and reconnected, the mdns-js browser does not send out a new aggregated query of the services it is interested in and so the advertisement does not respond. This is different to the Mac mDNS daemon which works with dns-sd to maintain a list of current service types of interest.
  2. If dns-sd is running dns-sd -B _nmos-registration._tcp, on a network cable disconnect and reconnect cycle an aggregated mDNS query is sent out including _nmos-registration._tcp. The mdns-js browser receives this under the bonnet but decides that it is not new, so does not send an update event to interested subscribers.

The first issue should be covered by the browser reset that occurs in ledger at the HTTP health check level. This creates a brand new browser that will send out new queries.

As the connections database where the state used to determine isNew is down in the shared networking object, I expect there is some stale state. I expect stale state needs to be deleted from existing connections on startup. Hmmm ... some more checking to do ....

Published a new version that recovers in all circumstances that I can simulate in my lab, including:

  • process failure and restart of the registration service;
  • removal and return of the network connection to the registration service;
  • removal and return of the network cable to the system hosting the node.

A forked version of the mdns-js library has been created that produces updates when the browser has recovered - see nmos-mdns-js. Remember to run npm update in the ledger folder to pull in this library.

Fixed version is v1.0.15 and is published to npm. Please update references in dependent projects to use this version or later.

Please reopen this issue if you find any problems.