lightningnetwork/lnd

[bug]: Interop: INVALID_ONION_HMAC

JssDWt opened this issue ยท 14 comments

Background

There is an interop issue with Core lightning that causes INVALID_ONION_HMAC errors. It consistently reproduces with the following setup:

LND sender -> LND routing node -> CLN routing node -> CLN recipient

It does NOT reproduce with

  • 4 LND nodes
  • 4 CLN nodes
  • 1 LND node and 1 CLN node
  • LND sender, LND routing node, CLN recipient
  • CLN sender, LND routing node, CLN routing node, CLN recipient

Your environment

  • version of lnd: 0.17.5
  • Seen in production on ubuntu, also locally on Mac M3
  • bitcoind 0.23
  • cln v24.02.2

Steps to reproduce

Create 4 lightning nodes with channels between them:

  • LND sender
  • LND routing node
  • CLN routing node
  • CLN recipient

Send a payment from the sender to the recipient over the two routing nodes.

Expected behaviour

The payment should succeed

Actual behaviour

The payment fails with INVALID_ONION_HMAC from index 1.

testing setup

  • Get binaries for lnd 0.17.5, cln 24.02.2, bitcoind and bitcoin-cli
  • Checkout the invalid_onion_hmac branch of https://github.com/breez/lspd
  • Review the test https://github.com/breez/lspd/blob/invalid_onion_hmac/itest/invalid_onion_hmac.go
  • Run the test with go test -timeout 5m -v ./itest -test.run TestInvalidOnionHmac --bitcoindexec /path/to/bitcoind --lightningdexec /path/to/lightningd --lndexec /path/to/lnd --bitcoincliexec /path/to/bitcoin-cli --testdir /path/to/temporary/testdir --preservelogs
  • Note that the test does not run any lspd related infrastructure, it's pure lightning nodes

logs

lnd-lnd-sender.log
lnd-lnd-router.log
lightningd-cln-router.log
lightningd-cln-recipient.log

CLN issue

ElementsProject/lightning#7347

Are you able to hit this outside of your testing infrastructure? Ran a quick test this morning with LND -- LND -- CLN -(private)- CLN on the specified versions and I'm unable to reproduce this failure.

@carlaKC this is an issue we've seen in production inconsistently with a few customers and we were to reproduce consistently in a testing environment.

Hi all, we're the ones getting the bug in production in case I can be of any help. This issue is preventing affected nodes running on Greenlight infrastructure to receive payments from Alby or Blink (only ones tested so far).

@carlaKC this is an issue we've seen in production inconsistently with a few customers and we were to reproduce consistently in a testing environment.

Very helpful to be able to consistently reproduce it! I'm just wondering what the difference is between the test setup and the quick test I ran manually - that'll be very helpful to try pin-point where the issue is.

I'm just wondering what the difference is between the test setup and the quick test I ran manually

I agree! The logs from a test are supplied at the bottom of the issue description with trace level from lnd. Maybe differences can be seen in there. If there's any other information that can be helpful, or running tests with specific settings, let me know.

have reproduced ๐Ÿ‘

@ellemouton Need a build with more data? I can have it spit out all the gory details to stderr....

We will give this error if they try to send a legacy onion, but nobody does that any more. Surely?

Indeed, LND log 2915:

   Payload: ([]uint8) (len=33 cap=64) {
    00000000  00 26 28 0f 74 eb c5 87  c8 00 00 00 00 00 00 27  |.&(.t..........'|
    00000010  10 00 00 00 88 00 00 00  00 00 00 00 00 00 00 00  |................|
    00000020  00                                                |.|

See that 00 at the front of the payload? That's legacy, which we removed support for in 23.05 ๐Ÿ˜ฌ (Wait, no, 22.11!)

Weird, we'll only try to send a legacy onion if either the hop hints sent over don't have the proper feature bits set, or we never got a node announcement from the destination node (so we don't know their feature bit set).

Ah, I think what's happening is that if CLN isn't setting the bit at all (we still set it, but have it marked as required), then we'd assume that the peer is super old and doesn't understand the new onion format.

@Roasbeef Can you elaborate which feature bits specifically?

Apparently (according to @ellemouton) the node hasn't seen the CLN node announcement, and is assuming the worst. FWIW we stopped generating them in our 2022-04 release, six months before we stopped accepting them.

You should probably just rip out the code which generates the old-style legacy onions altogether. For us it was a nice simplification...

@JssDWt the TLV onion feature bit.

What I think is happening here is CLN not setting the bit at all (or we don't have a node announcement, so we assume they don't support it). Newer lnd clients (0.18 and forward) will start to set the bit to required. Eventually we'll also just stop setting it all together, but we can't do that yet, as older clients (pre 0.18) will use the feature bit to decide if they can do things like MPP or not.

We can also just assume it's active (for 0.18), but as mentioned that won't help older clients (who still look at the bit).

CLN are setting the bit (it is set in the invoice) but the main issue is that the LND nodes in the setup are not seeing the CLN node announcement. They do get the channel announcement though & so know of the node but then assume a zero length feature bit vector. So that is the main issue.

The second issue (which I realised will only actually matter in the case where an invoice has chained hop hints) is that we (LND) assume that hop hints have zero length feature bit vectors (ie, we dont see them as having the TLV bit set and so we would use legacy encoding for these hops).