lightningnetwork/lnd

Pathfinding: take channel capacity into account

Closed this issue ยท 30 comments

In c-lightning (https://github.com/ElementsProject/lightning/pull/4771/files) and rust-lightning (lightningdevkit/rust-lightning#1166), moves have been / are being made to factor in the channel capacity relative to the payment amount.

I think that makes a lot of sense and is easily added to lnd's probability estimator. The logic could be applied in the probability estimation for channels that haven't been tried before.

Note that especially the channels which have been tried before can switch to conditional probabities as described in https://arxiv.org/abs/2103.08576 and more specifically with the concrete formulas in https://arxiv.org/abs/2107.05322

C-lightning currently ignores the uncertainty network as they would have had to change their logic and mission control does it not according to the capacity model which are suggest to create here. Of course your temporal decay of forgetting the information could persist.

I think @stefanwouldgo has code in scala that does it for lnd (even with tracking in flight htlcs) also I think @Roasbeef has seen part of the code in the repo mentioned in the second paper

Is there a centralization effect on demanding excess capital for a route?

Note that especially the channels which have been tried before can switch to conditional probabities as described in

We already do this and have for a few years now. Though our approach can certainly be improved, as is the system will continually update these conditional probabilities and also decay them over time to reflect the uncertainty of the network overtime since our last observation. Adding capacities to the mix should be a pretty straight forward change code wise. We'd likely add some knobs there as well to control the importance/inclusion of the new factor to make it easier to experiment with.

Is there a centralization effect on demanding excess capital for a route?

That's something we'd need to examine and way the tradeoffs for. Hopefully we'll have an initial version of the distributed A/B testing framework we're working on so we can properly test changes like this (as we've been sort of holding off to make sure our decisions are grounded in empirical evidence). Heuristically, one would say that if an amount is <<< than the channel size, then it should have a high success probability. However as you're getting at, we'd need to make sure individuals with very very large channels don't become a sort of attempt/retry sink.

I think @stefanwouldgo has code in scala that does it for lnd (even with tracking in flight htlcs) also I think @Roasbeef has seen part of the code in the repo mentioned in the second paper

This is another (imo) important thing that we don't currently do: in-flight HTLCs in the same payment session don't propagate back any information to the main mission control instance. This ends up triggering weird scenarios where we'll continue to attempt payment amounts that we know can't work (example if a 1 BTC channel with an 0.9 BTC HTLC present, and we try > 0.1 BTC).

xraid commented

if a channel capacity is 1, any attempts above 0.5 will have a high % in failing since bandwidth of a channel in practice will most of the time only be able have a portion as available liquidity. Retry rule with other expectation if else fails.

case: 0.1 try > 0.4 paths if path not flagged in MC, equals a pragmatic 25% of channel.

Is there a centralization effect on demanding excess capital for a route?

Agreeing with @xraid. Probably only lowering the probability for amounts that are a considerable part of the channel capacity is sufficient. I don't think it needs to be taken to extremes where a 1BTC channel is selected for a 10 sat payment over a 10M channel.

Strongly disagree with @joostjager : there is a whole theory and strong emperical evidence why -log((cap+1-amt)/(cap+1)) is a good feature In pathfinding. And also on how to adopt this formular after successful, failed attempts and with htlcs logged in. Mainly 2 question remain

  1. how fast to decay gained information back to the full uncertain case)
  2. How strong to enorporate it with other features?

Plot this formular for a 10M channel (in the discussion of the c Lightning pr I have done it with various other combined features for a 20k sat channel) and you will see that the cost of this feature hardly triggers for small amounts anyway due to its convex nature. So why would one choose arbitrary case distinctions of not choosing this feature for low amounts if the feature itself Already encodes such behavior?

Is there a centralization effect on demanding excess capital for a route?

Yes but that is not a consequence of the features chosen in selfish routing (c.f. Selfish routing and the price of anarchy work by Tim roughgarden https://www.amazon.de/Selfish-Routing-Price-Anarchy-Press/dp/0262182432 ) but a fundamental property of the lightning network itself that liquidity is necessary to guarantee certain service level agreements / objectives.

People who wish to send money will be selfish in making routing decisions and routing nodes / liquidity providers have to change the channel graph in adoption to the selfish behavior of users. It's the same as street planning: It needs resources so that folks who chose the fastest / quickest route can all do so without congestion. Just look at braess paradox https://en.wikipedia.org/wiki/Braess%27s_paradox which is actually an answer (to your concern) that indicates that people don't want to open too large channels if we choose this feature

Anyway I am getting nerd snipes here

I think you might be interested in concrete formulas how to update your uncertainty / probabilities in case of failed / success full payment attempts. I have summarized the two papers in an issue of another PR lightningdevkit/rust-lightning#1170 (comment)

Ldk (rust lightning) has not only implementated probabilistic pathfinding but also the proper maintanance of the uncertainty network with bayesian updating in this pr lightningdevkit/rust-lightning#1227 might be easier than reading the paper to just look at how they did it. Note they have two arbitrary choices in their code:

  1. The half life time of 3600 seconds for bayesian updates
  2. The way they combine negative log probabilities with other features

Thanks for all the links/notes Rene! Based on current timing/velocity, I'm not sure we'll be able to properly implement, review, and test (prob the most important thing!) this change for the 0.15 release. As a result, I'm moving this to the 0.16 tracker. On the up side, pushing this back a few more months gives us more time to learn from what others are doing, and also gives us more time to draft up something that can in theory kill several birds with one stone.

That sounds reasonable to me. @Roasbeef could you take care of #5746 a bit earlier, though? This way one could use lnd with external pathfinding solutions to gather first insights. Currently, this isn't possible as described in the issue. I'd be very happy to provide a usable implementation as part of lnd-manageJ rather soon, which depends on the issue being fixed before it can be used for real MPPs.

from @Roasbeef

and test (prob the most important thing!)

I fully agree though I am very confident that you will end up doing an adoption / extension of what I propose here: renepickhardt/mpp-splitter#11 or the notebook directly on the branch at: https://github.com/renepickhardt/mpp-splitter/blob/pickhardt-payments-simulation-dev/Pickhardt-Payments-Simulation.ipynb

This is besides engineering challenges the gist of how I belief payments should be delivered over lightning.

In any case the notebook as well as the stuff that @C-Otto works on may yield great chances or basis for testing. In my case You could just overload the OracleLightningNetwork class (and make the pay loop async which happens natively in go I guess) and you can test against mainnet. While I start with complete uncertainty the code of @C-Otto pulls info from mission control AND (!) also collects necessary data that mission control currently does not even store (as far as I understand). success on a channel from a probabilistic point of few needs to be handled differently depending on the entire onion settleing of failing downstream, as the information about liquidity is to be interpreted differently.

With respect to node overall quality in mission control, you could also apply a similar rubric to how failures impact a node quality. If a node fails with 1 sat less than max possible on a channel, that is a minimal mission control penalty since it was logically likely to fail. If you fail with a midline value that is the full penalty. This would help incentivize nodes to try and keep their channels more reliable for bidirectional traffic and then help pathfinding recognize such nodes to avoid spending time on nodes that are not bidirectional.

Encouraging bi-directionality is good for decentralization because if odds are poor each hop of guessing the distribution on the correct side, it produces a very steep drop off in the number of channels you'd want to try in your route due to the rapidly compounding potential failure chance, at least for the type of payer scenario where reliability is highly prioritized over fees.

Then from the routing node overall quality perspective you have a goal which is weighted to fulfill midline size HTLCs with the highest quality: try to keep your channels reasonably balanced, with respect to where coins are moving to avoid a superfluous requirement to keep your coins on a side where there is no traffic. This also allows for private shadow channels to improve your capitalization privacy: the expectation of capitalization is set with the public channel pointer and then behind you can stack private channels to increase throughput

For Neutrino nodes taking capacity into account may also be an issue because Neutrino isn't pulling the capacity of the channels, it's looking at the max HTLC size. You wouldn't want to use max HTLC size as a substitute for capacity there because that size is mutable and is used to help set expectations itself

try to keep your channels reasonably balanced, with respect to where coins are moving to avoid a superfluous requirement to keep your coins on a side where there is no traffic.

Even if all payment pairs were equally distributed the fact that fees are not (and taken as a feature in (selfish) routing) puts usually a direction of experienced flow on channels. (I belief this is often called drain.) While I guess this can also be seen on chain from channel closes and openings I just created a simple modle showing what failure rates we can expect on a channel because it is being depleted](https://twitter.com/renepickhardt/status/1526575556032180226) (depending on the drain / flow of coins that may come from selfish routing) I would not be surprised if @bitromortac has thoughts about this

Bildschirmfoto 2022-05-17 um 16 41 58

What I try to say: of course there is a failure rate to be expected that is independent of the uncertainty of liquidity in the channel and more dependent on the drain and selfish routing of sending nodes

Motivated by @renepickhardt's work I took a look at mission control data to check the assumption of a uniform balance distribution in the LN. I computed node pair-based success probabilities grouping node pairs into capacity ranges and averaged over successes and failures to route ~500000 sat. This results in the following plot:

success

Comparing the data (dots) with estimates from uniform and bimodal models (curves) we see that at least for this amount, the measured points are not in line with a uniform distribution where in the limit of large channels the success probability should approach unity. (There may be unknown biases here.)

As a second step, I assumed a bimodal distribution following a liquidity distribution X of P(x) ~ exp(-x/s) + exp((x-c)/s) ("bimodal constant model"), where s is a liquidity broadening scale and c is the channel capacity. This distribution describes situations in which channel liquidity is typically found on the local or the remote side, but rather not in between (unbalanced channels). It seems like this model is able to better describe the observed data (there is a model variation "bimodal variable", which works even better, but I won't describe it here as it's similar).

bimodal_prior

Following Rene's method, I derived prior and posterior success probabilities for the bimodal distribution.

Comparison Bimodal and Uniform model:

The characteristic behavior from the bimodal model (s=400000 sat) can be seen here:

bimodal

The graph displays estimated success probability under a bimodal prior distribution for different channel capacities. Curves for different amounts a are shown, where red color code means sending an amount of the full capacity limit, i.e., no channel is able to forward the amount (P=0), because capacities do not permit that. The other amounts are chosen with a log spacing, where the amount value can be read off the intersection of the curve with the x axis. We see that the bimodal model is independent of the capacity given the amount is larger than the channel capacity. For large amounts, we have a success probability of 1/2 as expected (the channel either forwards or not). For smaller amounts the success probability increases generally as more and more channels will have residual liquidity to forward due to the parameter "s" in the model.
The strategy in such an environment would be to decrease amounts to get to high success probabilities.

The equivalent plot for the uniform model is shown here:

uniform

The uniform model has the property that if one wants to send any amount, it is the best strategy to select the largest channel one can find to maximize success probability (statistically speaking).

For brevity I won't show how the posterior success probabilities behave, but I would like to contrast the two models:

Uniform:

Disadvantages:

  • tends to select large channels, thus may lead to congestion and centralization

Advantages:

  • can be used to calculate min-cost flows as negative log probabilities
  • is more "caring", as it leaves over some balance for the next payment, could have a balancing effect

Bimodal:

Disadvantages:

  • is more greedy: it exhausts channels (but leaves back amounts on the scale s) once it learns that liquidity is present (a user might act selfishly in the uniform model as well, so this is not a strong argument)
  • as a pure bimodal model, it is not compatible with min-cost flow computation as the negative log function is not convex

Advantages:

  • reflects current network conditions
  • is very generic and could be more flexible to describe changing network conditions: we can use the bimodal model to interpolate to the uniform model (by increasing s), reintroducing convexity
  • is in line with current LND pathfinding philosophy (constant a-priori probability, high probability after success)

Both models enable us to make smarter decisions based on prior events like successful/failed/in-flight payments.

I suggest to implement the bimodal-model based probabilities because:

  • it is more in line with the pathfinding system as currently implemented in LND, thus being a less disruptive change
  • it distributes HTLCs across different channel size ranges and doesn't focus on the "highways"
  • the bimodal model is tunable between a sharp bimodal and a uniform distribution therefore it is able to incorporate the uniform model's advantages and can be adapted to network changes in the future

Hope the comparison was fair and am happy to gather some feedback from you.

Interesting observations and analysis. I guess you could always question the conditions under which mission control was populated. Is it based on a representative set of payments, is this what the average user is also going to encounter? But that seems to be a question that's almost impossible to answer given the dynamic nature of the network and variation in positioning of senders and receivers.

Maybe an additional way to validate could be to ask node operators to share a snapshot of channel balances, possibly pre-aggregated into a histogram to minimize the disclosure of information. Or make it even simpler by offering a small tool that does this. For the goal of making payments more reliable, I'd think that there must be operators willing to do this?

Assuming the described distribution is indeed a good match, what parts of lnd's probability estimation would need to change? I suppose the apriori probability is the main one?

Besides that, there is probably going to be some interaction with the way we factor in failures and scores of other channels of the same node too?

Interesting observations and analysis. I guess you could always question the conditions under which mission control was populated. Is it based on a representative set of payments, is this what the average user is also going to encounter? But that seems to be a question that's almost impossible to answer given the dynamic nature of the network and variation in positioning of senders and receivers.

Thank you, right, I think it will give us a chance to set up something smarter, but it makes sense to do it in a conservative way.

Assuming the described distribution is indeed a good match, what parts of lnd's probability estimation would need to change? I suppose the apriori probability is the main one?

Yes, I see most changes around getPairProbability, possibly it interacts also a bit with the node probability.

In this thread, we've been talking quite a bit about channel capacity as an input for probability estimation. It's also the title of this issue. But what will happen if channel capacities are no longer available, for example because of a new on-chain-privacy-preserving gossip system?

In a world where a single max_htlc is the only advertised property of a connection between two nodes, is it even possible to estimate an a priori probability that is more than a fixed value?

It may also be worth to think about what users would set max_htlc to if that's the only thing that can be communicated. Wouldn't the rational decision be to set it to the channel capacity, to not risk missing a forward that could be forwarded under ideal conditions?

People do set the max htlc to a smaller amount than capacity in order to communicate a sustainable amount that they can forward, many node operators don't want people to try routing through their channel and fail. Economically speaking this can be represented in not wanting a mission control penalty for a future route

@bitromortac and I discussed this a bit further offline. Perhaps a simple improvement to lnd pathfinding could be to:

  • Increase a priori probability for channels that have an explicitly set max_htlc. This is an incentive to set it to something.
  • Penalize temporary channel failures for channels with max_htlc harder (longer half life time). This is an incentive to set max_htlc to something that can be delivered.
  • Penalize temporary channel failures for channels with max_htlc harder (longer half life time). This is an incentive to set max_htlc to something that can be delivered.

Updating max_htlc all the time would create a lot of gossip.

Some random thoughts:

Motivations behind max_htlc?

  • Routing node operators expect the route not to be tried if an amount larger than this amount is to be sent (thus not generating failures).
  • What would happen if a forwarding has happened? Would node operators adjust max_htlc? Looks like that doesn't scale as C-Otto mentioned. Fees are a better way to do this, because they are able to enforce a velocity, an expected amount per time.
  • If it doesn't scale, then the operator would set it to something that is smaller than what he can route, to accommodate multiple routing events. In this case capital is not used efficiently.

We would also have two scenarios:

  • capacity and max_htlc are known: max_htlc could be used as the success_amount and or fail_amount
  • only max_htlc is known: max_htlc could be used as the capacity

I'm not perfectly sure how we should interpret max_htlc (does it denote the amount the node can route for sure?), will do a bit of gossip research.

If it doesn't scale, then the operator would set it to something that is smaller than what he can route, to accommodate multiple routing events. In this case capital is not used efficiently.

Maybe this is the cost of fast payment network? I believe there will be a point in the future where even a single failure is no longer tolerated and results in a hard ban. It is already difficult enough to compete on speed with a conventional card payment for just a single payment attempt.

I agree with the fees direction @bitromortac mentioned and agree that setting this number in a very fine tuned way is not scalable and also has negative privacy implications

For max_htlc I could also see it relating to the capacity question by setting the default at what is the high probability number for what can be routed in a very big HTLC that would want to use the channel. Like 99.9% of capacity = low chance of success, 80% of capacity = reasonable chance of success, or whatever the model shows. This would be essentially free from a gossip point of view

Is there a centralization effect on demanding excess capital for a route?

Yes but that is not a consequence of the features chosen in selfish routing (c.f. Selfish routing and the price of anarchy work by Tim roughgarden https://www.amazon.de/Selfish-Routing-Price-Anarchy-Press/dp/0262182432 ) but a fundamental property of the lightning network itself that liquidity is necessary to guarantee certain service level agreements / objectives.

People who wish to send money will be selfish in making routing decisions and routing nodes / liquidity providers have to change the channel graph in adoption to the selfish behavior of users. It's the same as street planning: It needs resources so that folks who chose the fastest / quickest route can all do so without congestion. Just look at braess paradox https://en.wikipedia.org/wiki/Braess%27s_paradox which is actually an answer (to your concern) that indicates that people don't want to open too large channels if we choose this feature

Anyway I am getting nerd snipes here

Dont you think there is not enough economic incentives for large masses to conduct routing and hence we might be facing this issue which could be temporary Anyways I am just a newbie learning kindly excuse if the question is not valid.

First of all kudos to @bitromortac for digging into challenging the uniform prior assumption. I unfortunately don't have enough time right now to address all of the points that you are raising (as promised out of band I will come back to this) but I think I can explain why we do not see a uniform distribution of liquidity as indicated on the mailinglist and in this blogarticle.

TL;DR: the distribution that we currently see mainly emerges from the flow of payments (who wants to pay whom) and the optimization critera that nodes chose in pathfding (e.g. optimizing for fees). I would be very careful to draw conclusions from any of those results. Because:

  1. it does not take into account how the distribution changes if we adopt the optimation goals
  2. it does not take into account that the actual selfish behavior may be against other nodes.

With respect to the 2nd point: As of right now - where literally everyone optimizes for fees - I can write down a very easy and probably excellent working path finding strategy which reads:

"Compute expected drains on channels only take edges against the direction of drain and optimize for fees".

I am very sure as of now this path finding strategy will work amazingly until a software implements it and people start to using it widely which would change the prior belief about the drain... (Long story short: I think we need to figure out the full game theory, dominant strategies and nash equilibrium instead of having ad-hoc solutions (like the ones that we and everyone else has been proposing so far)). Also the bigger concern will be how we can design the routing game so that the price of anarchy of the domant strategies will be low. I hinted to this at the liquidity ads proposal

@bitromortac wrote:

As a second step, I assumed a bimodal distribution following a liquidity distribution X of P(x) ~ exp(-x/s) + exp((x-c)/s) ("bimodal constant model"), where s is a liquidity broadening scale and c is the channel capacity. This distribution describes situations in which channel liquidity is typically found on the local or the remote side, but rather not in between (unbalanced channels). It seems like this model is able to better describe the observed data (there is a model variation "bimodal variable", which works even better, but I won't describe it here as it's similar).

This is actually the reason why I am giving such a short and quick reply: While I was trying to model the random process a bit more specifically in this notebook I realized that the bimodel assumption seems pretty weired. As soon as we have drain the liquidity will be on one end. Is there a way of modelling this with removing one mode and starting from a random drain which would put the liquidity for a specific channel either on one side or the other? So making P(x) ~ exp(-x/s) or P(x)~exp((x-c)s) where your choise of s probably depends on the strenght of the drain and the choice of the first or second deistribution on the direction. I would be very surprised if such a simpler (non bimodal model) could not predict the data from mission control.
Actually I think we could even go one step further: for each direction of the channel one could chose P(x) ~ exp(-x/s) which would make the information content - that I called uncertainty cost or negative log probabilies - (-log(P(x) = x/s) convex again.

Specifically it is very similar to the linearized uncertainty unit cost that I used which was defined as 1/c where c was the capacity of the channel. In your case without the necessity to linearize things I guessed above that 1/s would be related to the prior assumed drain of the channel. While the drain of the channel might not correlate with the size I can imagine the similarities of the outcome would explain why @C-Otto was succesfully sending 0.6 Bitcoin to my node using the probabilistic model based on channel capacities.

Re-opening this as #6815 is still lingering (relevant so we can do proper head to head comparisons).