waku-org/nwaku

chore(rln): move from epoch based gap to timestamp based.

Opened this issue ยท 9 comments

Background

We currently validate that an RLN message epoch is +- a given gap (with respect to the current one) determined in epochs, see. This is good since it leaves some margins for clock jitter among nodes, propagation/processing times, etc.

However, when having long epoch sizes (eg several minutes or hours), an old message could be replayed into the future to spam the network. And since the gossipsub window just covers 2 minutes, the network will see this message as new.

To solve this issue, its proposed to:

  • Make timestamp mandatory for rln messages
  • Include the timestamp in the RLN signal, here
  • Use this timestamp as the allowed gap. Eg, +-10 seconds.

Makes sense to me. Thanks for creating this.

Commented elsewhere, but I think we may have to validate that the timestamp matches the epoch against which the proof was generated too. Since the epoch is still the unit against which a rate limit is measured, an attacker could presumably use very old/future epochs to generate proofs and just ensure that they use current timestamps if they want to spam the network?

Some things to take into account for determining the timestamp diff we allow:

  • Propagation time of the message. In a very large network this can get quite high. Eg if a node generates the message + timestamp at t0 and the messages hops 5 times, the receiver will see a diff of 5*prop_time. Ofc this should be < than the gap we allow.
  • Gossipsub window size (2min now). The gap can't be higher than that, otherwise we won't be protected by the gossipsub cache window.

cc @s-tikhomirov

the gossipsub window just covers 2 minutes

Is the gossipsub window you mention the same as seen_ttl defined here? The value of 2 minutes is mentioned there as a "reasonable default", not a strict requirement. I'd say we should then use a non-default seen_ttl value for our use case.

The RLN semantics require RLN Relay nodes to keep track of all messages at least within the current epoch to detect spam. If the gossipsub cache is too short for this purpose, isn't the simplest solution just make it longer, for example, equal to to the epoch length (10 mintes)? Can we do it? What are the downsides?

The RLN semantics require RLN Relay nodes to keep track of all messages at least within the current epoch to detect spam. If the gossipsub cache is too short for this purpose, isn't the simplest solution just make it longer, for example, equal to to the epoch length (10 mintes)? Can we do it? What are the downsides?

Main downside is that your cache (in memory) will be bigger, leading to a higher memory footprint. And well, if someone deploys a network with an epoch of 1 day, then we need seen_ttl of 1 day? Said footprint ofc depends on the amount of messages being sent over that period. The advantage of that solution is that we don't need to modify the code, just tune a config parameter. One thing that is indeed interesting with this approach is that timestamps don't need to be enforced, which adds some privacy? (eg mitigate timing attacks?)

but imho its not about modifying the seen_ttl (which can be done) but about setting a reasonable window size (eg 2 min) and then enforcing it. does it make sense to relay a message that was generated at t0 in t0+9minutes? not sure that offers something valuable, but it opens a bunch of attack vectors.

does it make sense to relay a message that was generated at t0 in t0+9minutes?

It depends on the use case. I don't think it's up for the protocol level to decide. I can imagine, for example, some commit-reveal scheme (a closed-bid auction or something similar) that time-stamps bids but reveals them 9 minutes later.

One thing that is indeed interesting with this approach is that timestamps don't need to be enforced, which adds some privacy? (eg mitigate timing attacks?)

Can we really enforce timestamps though? And what do you mean by "enforce"?

Generally speaking, I think we should define our security assumptions around timestamps. If we assume that most (maybe "most" requires a more precise definition) Relay nodes have their local clocks within, say, 1 second of the real physical time, then we can define the cache size in terms of physical seconds. And even if the attacker tries to manipulate with timestamps, its message will be dropped by most Relay nodes.

An analogy: in Bitcoin, timestamps partially determine block validity: a block timestamp must be higher than the median value of the past 11 blocks. Timestamps thus don't have to follow block progression (it might be that t_n > t_{n+1}), however, this rule prevents wild timestamp manipulations without adding too many trust assumptions.

We'd have to think carefully then about edge cases. Say, a message is generated at time t1 and committed to epoch e(t1). A given Relay node receives this message at time t2>t1 during epoch e(t2). Let g be the maximum allowed gap (seen_ttl). There are two independent message validity criteria now:

  • t2 - t1 > g?
  • e(t2) == e(t1)?

A Relay node's actions can be as follows:

abs(t2 - t1) < g abs(t2 - t1) > g Comment
e(t2) < e(t1) Keep in cache until e(t1) and then relay. Drop the message. The message is committed to a future epoch but relayed prematurely.
e(t2) == e(t1) Happy path: relayed on time within the right epoch. Relay. Relayed too early or too late within the right epoch. If early: keep in cache and relay at t1. If late: drop. The message is committed to the same epoch as when it is relayed.
e(t2) > e(t1) Relayed just a bit too late; perhaps in the last seconds of the correct epoch. Drop. Suggestion for users: don't send in the last seconds, better re-generate the proof for the next epoch. Relayed too late. Drop. The message is committed to a past epoch.

All in all, I agree that if we trust honest nodes to be in sync with "real" time anyway, then we can also define timestamp-based validity conditions on messages.

A Relay node's actions can be as follows:

I'm not sure that the complexity here makes sense. Since epoch sizes can vary wildly (think epoch of a day or longer), it doesn't make sense to include epoch comparison to determine if a message has been sent roughly within the approximately real-time environment of Relay. For example, if we cache messages in the same epoch but early timestamp, it will be easy to attack this cache by simply generating millions of messages with early timestamp but valid epoch.

In general, the epoch is useful only as a rate-limiting window, not as real-time check. The timestamp on its own can be used to introduce some TTL-like behaviour for messages, necessary to protect against past/future replay injection attacks. We are (and should be) very generous here. Previously we allowed up to 20 second timestamp variance in either direction. On the assumption that Relay nodes and publishers have to function roughly within this (wide) real-time window, it becomes easy to simply drop the message if abs(t2 - t1) > g with g == 20.

Can we really enforce timestamps though? And what do you mean by "enforce"?

Yep. no timestamp = reject message + lower score of the sender. timestamp_diff > eg 20 seconds, reject message and lower score of the sender. since the idea is to have the timestamp to be included in the signal (part of the zk proof input) that should cover it.

it doesn't make sense to include epoch comparison to determine if a message has been sent roughly within the approximately real-time environment of Relay

But it is still required to compare message epoch with the Relay's node current epoch to detect rate limit violatoins, right?

if we cache messages in the same epoch but early timestamp, it will be easy to attack this cache by simply generating millions of messages with early timestamp but valid epoch

I'm not sure I understand this.

Sorry if I get this discussion side-tracked, but I really want to get to the bottom of this :) From what I understand from the spec:

  1. epoch is derived deterministically from a UNIX timestamp;
  2. message validity depends on it being relayed in the "current" epoch;
  3. to detect rate limiting violation, a Relay node must keep a cache of messages relayed within the "current" epoch.

A hidden assumption in the RLN spec as it stands now it that it uses "current epoch" as a single term, whereas in reality there is a difference between the "current" epoch as defined by:

  • the RLN proof a message is committed to;
  • the time when the sender first relays the message;
  • the time when a given Relay node first sees this message.

In summary (please let me know if any of the following is based on a misunderstanding):

  1. I agree that an additional timestamp-based validity condition might be helpful to prevent deliberate timestamp manipulation.
  2. RLN validity is still defined in terms of epochs, therefore, a Relay node must at least be aware of all messages trom the latest epoch to detect spam attempts;
  3. If we add timestamp-based validity conditions, we should address edge cases at epoch boundaries (or accept them as inevitable);
  4. Long-term, it'd be great to have our assumptions around clock synchronization defined precisely as it relates to RLN proof validation.

it is still required to compare message epoch with the Relay's node current epoch to detect rate limit violatoins, right?

I don't think so. Afaiu rate limit violations are checked by caching the nullifiers generated within an epoch and ensuring that no double signalling occur. Of course, we should ensure that the caching occurs for all epochs where valid messages (with valid timestamps) could occur, but I think this is done in the implementation already. The point is that the epoch in a message does not have to be the current epoch for the message to be valid.

message validity depends on it being relayed in the "current" epoch;

I don't understand it this way. The message is generated with the "current" epoch as seen by the publisher. Currently, this does not have to match the current epoch of the validator, as long as there's no more than max_epoch_gap between the published and validating epoch. The proposal is to change to a timestamp-based validation. Indeed, in implementation we should ensure that nullifiers are cached for all epochs that could conceivably fall within the valid timestamp window, but the epoch itself is not a validation requirement.

a Relay node must keep a cache of messages relayed within the "current" epoch.

IIUC a cache of nullifiers within all epochs that could contain messages with valid timestamps

RLN validity is still defined in terms of epochs

Only the rate-limiting is defined per epoch.

a Relay node must at least be aware of all messages trom the latest epoch to detect spam attempts

It must be aware of all valid messages from whichever epochs can contain valid messages. Valid message is one with a timestamp within an acceptable window from current time as measured by the validator. You're absolutely right that nullifiers we cache might still cover several epochs to account for boundary conditions, but this is an implementation detail and not related to message validity rules