pubkey/rxdb

WebRTC Peer Gone issues

bastiankistner opened this issue · 12 comments

Hi Daniel,

I've been using v15 since beta 2x or so, primarily due to the promising WebRTC support, for an order processing system. I had some issue until today that I was unable to reproduce or even analyse yet. I figured you've also been working on the simple-peer connection handler recently, which I am using (but slightly modified it to get it to work with AWS API Websocket Gateway).

Long story short: I often but randomly experienced errors being thrown while trying to send a message to a peer and the whole sync got stuck (the error originates here:

await (peer as any).send(JSON.stringify(message));

At first I thought it was my implementation of the signaling server so I reimplemented it. I still had those issues. Then I tried using your new signaling server example. The latter worked fine in the beginning until I refreshed the browser, which led to a new peerId being assigned, while the old one still remained in the room. This seems to cause the issue when sending the messages because all participants still believe there are more peers that have already gone. We also cannot handle this reliably on the server side as it's never absolutely certain that the server's room state reflects the actual situation due to connectivity issues or lost connections.

This case, when a peer isn't available anymore doesn't yet seem to be handled. And when RxDB tries to send a message it get's kinda stuck. Until I refresh the browser that was changing the data. Shortly before it refreshes the page, the updates bubble through to the available peers.

So I tried the following and it seems to work fine:

if (peer.writable) {
  await (peer as any).send(JSON.stringify(message));
} else {
  error$.next(
    newRxError('RC_WEBRTC_PEER', {
      error: new Error('Peer gone'),
    }),
  );
}

I can also open a PR, but I first wanted to let you know and maybe see if this is the right way to handle it.

Thanks a lot! Really looking forward to being able to enjoy a reliable sync and hopefully upgrade to premium when my customers are happy with it.

Just experienced the same issue during the signaling action. Is the error$.next(...) a valid approach to inform RxDB about peers that have left? I'd take a deeper look into how it's handling such issues tomorrow.

And I'm seeing another issue. When I start the sync, it works fine for the first 120 items. But after that, the peer closes. Although, it might also be related to simple-peer. I'll find out more.

Have you yet thoroughly tested WebRTC replication yourself? I could also maybe provide access to the application I'm building.

This is what happens after the first 120 items:

CLIENT(PSWKPeoYDoECHIQ=) peer got error:
RTCError: User-Initiated Abort, reason=Close called

code: "ERR_DATA_CHANNEL"
errorDetail: "sctp-failure"
httpRequestStatusCode: null
message: "User-Initiated Abort, reason=Close called"
name: "OperationError"
receivedAlert: null
sctpCauseCode: 12
sdpLineNumber: null
sentAlert: null

Nice that you found my little signaling server :)

I worked on the signaling server for the last days. It had several problem like not clearing up disconnected peers. This should be more stable now. My final goal is to harden it as much as I can so that it can run "forever".

There is no realy way to determine if a peer has disconnected. So it should just catch message errors and reconnect the whole connection. This is not implemented yet, PR is welcomed. EDIT: Working on a fix.

Released 15.0.0-beta.42 with a fix that reconnects broken peers. Please test.

I just had less than an hour today to test it. It works fine in the beginning, but then I see replication issues on one of three clients. I’ve deleted the indexeddb in all browser tabs (two chrome profiles and one additional machine with another chrome profile). Then I switched accounts and it happened again but on my second machine. I usually synchronized ~200 items, which works great. But then I modify a few of them across accounts and suddenly the last few changes of like 20 updates have different states. If I refresh the browser, it usually continues to work. But not fully reliable and those items that went off before are never being re-synchronized again.

I could also get an error being thrown here

const peerState = getFromMapOrThrow(this.peerStates$.getValue(), peer);

I’m sorry I don’t yet have more details or provide a PR. But my next goal is to debug this part. I would assume that the error being thrown stops the replication process. But neither have I worked in depth with rxjs nor have I had the time yet to fully understand rxdb sources.

The WebRTC replication doesn’t feel very mature. Do you know anyone who’s using it in production? I think the idea is fantastic, but I’d really need to rely on it otherwise I’ll have to drop it 🥺

Sorry I accidentially closed this.
WebRTC replication is in beta atm. The goal was to move it to non-beta in the new RxDB v15 release.
It would be great to know if there is a problem with the webrtc connection or with the replication protocol itself (which would be bad).
Are you using a custom conflict handler?

I was using a custom conflict handler based on a timestamp. But I moved to using the CRDT plugin. If you have any questions, let me know! I'll be able to look into it tonight and provide you feedback.

I could also get an error being thrown here

I need the whole stack trace for that.

It seems to me that this is a problem with the simple peer library.
Maybe you can reproduce the exact behavior with the quickstart repo.

Are you on the newest chrome version?

Following is a stacktrace when a peer was missing.

However, I'm using a slightly modified version of the getConnectionHandler.ts, which I need to be able to run the signaling server on AWS API Gateway. Reasons are:

  • I need to add search params to the url, which contain a token to be able to connect through a lambda authorizer. I added this as headers to the SimplePeerConnectionHandlerOptions (should probably rename that) and I have to re-generate the URL on every reconnect to ensure I'm establishing the connection with a fresh token
  • I cannot return anything during the connect phase, therefore I have to send a init message after the connection was established to start the whole connection flow

I've added the file to this post as a zip. You don't need to look at it and you can also wait till I've tested it with your signaling server and the original getConnectionHandler. I'll try to set it up tomorrow and run some more tests!

Unhandled Runtime Error
Error: missing value from map [object Object]

Source
getFromMapOrThrow
../../node_modules/.pnpm/@github.com+pubkey+rxdb+archive+refs+tags@15.0.0-beta.42.tar.gz_rxjs@7.8.1/node_modules/rxdb/dist/esm/plugins/utils/utils-map.js (4:0)
RxWebRTCReplicationPool.removePeer
../../node_modules/.pnpm/@github.com+pubkey+rxdb+archive+refs+tags@15.0.0-beta.42.tar.gz_rxjs@7.8.1/node_modules/rxdb/dist/esm/plugins/replication-webrtc/index.js (174:37)
Object.eval [as next]
../../node_modules/.pnpm/@github.com+pubkey+rxdb+archive+refs+tags@15.0.0-beta.42.tar.gz_rxjs@7.8.1/node_modules/rxdb/dist/esm/plugins/replication-webrtc/index.js (37:0)
ConsumerObserver.next
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/Subscriber.js (96:0)
Subscriber._next
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/Subscriber.js (63:0)
Subscriber.next
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/Subscriber.js (34:0)
eval
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/Subject.js (41:0)
errorContext
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/util/errorContext.js (19:0)
Subject.next
../../node_modules/.pnpm/rxjs@7.8.1/node_modules/rxjs/dist/esm5/internal/Subject.js (31:20)
../packages/realtime/dist/rxdb-utils/getConnectionHandlerV2.js (149:52) @ next

  147 |     if (!disconnected) {
  148 |         disconnected = true;
> 149 |         disconnect$.next(newSimplePeer);
      |                    ^
  150 |     }
  151 | });
  152 | newSimplePeer.on('connect', () => {
Call Stack
Peer.emit
../../node_modules/.pnpm/next@14.0.3_@babel+core@7.23.3_react-dom@18.2.0_react@18.2.0_sass@1.69.5/node_modules/next/dist/compiled/events/events.js (1:2420)
eval
../../node_modules/.pnpm/simple-peer@9.11.1/node_modules/simple-peer/index.js (496:0)

getConnectionHandlerV2.ts.zip

I am running Version 119.0.6045.199 (Official Build) (arm64) of Chrome. Just figured there is an update. Downloading it and will test again.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed soon. If you still have a problem, make a PR with a test case or to prove that you have tried to fix the problem. Notice that only bugs in the rxdb premium plugins are ensured to be fixed by the maintainer. Everything else is expected to be fixed by the community, likely you must fix it by yourself.

Issues are autoclosed after some time. If you still have a problem, make a PR with a test case or to prove that you have tried to fix the problem.

I'm also experiencing similar issue with very basic setup (one collection, wss://signaling.rxdb.info/ server, Chrome on MacOS + Chrome on Android)

RxError (RC_WEBRTC_PEER): RxError (RC_WEBRTC_PEER):
    RxReplication WebRTC Peer has error
    Given parameters: {
      error:{
        "code": "ERR_CONNECTION_FAILURE"
      }}
    at newRxError (rx-error.js:98:10)
    at _Peer.<anonymous> (connection-handler-simple-peer.js:116:31)
    at zone-patch-rxjs.js:98:41
    at proto.<computed> (zone.js:962:24)
    at EventEmitter.emit (events.js:81:17)
    at index.js:496:21
    at _ZoneDelegate.invokeTask (zone.js:402:31)
    at _Zone.runTask (zone.js:173:47)
    at drainMicroTaskQueue (zone.js:581:35)
  ERROR Error: missing value from map [object Object]
    at getFromMapOrThrow (utils-map.js:4:11)
    at RxWebRTCReplicationPool2.removePeer (index.js:174:21)
    at Object.next (index.js:37:147)
    at ConsumerObserver2.next (Subscriber.js:96:33)
    at Subscriber2._next (Subscriber.js:63:26)
    at Subscriber2.next (Subscriber.js:34:18)
    at _ZoneDelegate.invoke (zone.js:368:26)
    at Object.onInvoke (core.mjs:14695:33)
    at _ZoneDelegate.invoke (zone.js:367:52)
    at _Zone.run (zone.js:129:43)