tomp2p/TomP2P

TomP2P Performance issues over time

Opened this issue · 4 comments

I've been running some minor benchmarks on tomP2P, and I'm in need of some reference benchmarking. I'm curious whether my results are similar to the original benchmarks of the system - could you please provide me with some relevant material to help me?

I've used TomP2P in a distributed storage network, built for low latency and high reliability, with tomP2P as a third layer of storage, or rather a persistence layer, and I've seem to run into some performance issues which become apparent over time.

My test:

  • Nodes on different hosts [40]
    • Node 1 x16 peers
    • Node 2 x9 peers
    • Node 3 x15 peers
  • RPS: 10
  • Duration: 1800 sec
  • PUT commands only (on 8 selected storage nodes only)

from my artillery test against a rest endpoint:

All virtual users finished
Summary report @ 21:54:17(+0200) 2021-05-03
  Scenarios launched:  18000
  Scenarios completed: 13342
  Requests completed:  13342
  Mean response/sec: 9.95
  Response time (msec):
    min: 2
    max: 9995
    median: 8
    p95: 5789
    p99: 9162.1
  Scenario counts:
    peer-2: 2286 (12.7%)
    peer-1: 2159 (11.994%)
    peer-12: 2306 (12.811%)
    peer-13: 2217 (12.317%)
    peer-18: 2249 (12.494%)
    peer-4: 2328 (12.933%)
    peer-17: 2174 (12.078%)
    peer-3: 2281 (12.672%)
  Codes:
    200: 1381
    400: 14
    408: 11947
  Errors:
    ETIMEDOUT: 4200
    ECONNRESET: 458

As you can see I have a lot of timeouts and failing connections. Also Seeing quite a bit of timeout warnings as expected

WARN  TimeoutFactory - Channel timeout for channel Sender [id: 0xbd747647, L:/0:0:0:0:0:0:0:0:57213].
WARN  TimeoutFactory - Request status is msgid=-1862733718,t=REQUEST_1,c=PING,tcp,s=paddr[0x975eb768918c948a5de0dc3cc419b424bd131363[/192.168.0.150,5491]]/relay(false)/slow(false),r=paddr[0x99830ba9278cafaa6a52bda03c1755733463c0de[/<my-ip, port>]]/relay(false)/slow(false)

I'm using the latest stable version of TomP2P

I'd really appreciate any help/advice
Implementation reference - https://gitlab.com/iggydv12/nomad/-/blob/master/src/main/java/org/nomad/storage/overlay/TomP2POverlayStorage.java

Put example logs
overlay-put-test.log

@tbocek any advice? :)

After some digging I found that one of the nodes (OS X node) was causing the performance issues and causing many of the messages to time out. My current setup is definitely flawed for OS X, for some reason, or It's a bug. I'm not able to say at this point.

One thing I've noticed is that as the ring grows, the reliability successful requests goes down, it seems like for some reason the data becomes distributed (re-distributed) and is no longer available to other nodes?

For bootstrapping I don't always use the same node - I'm not sure if that is an issue?

Version 14.0 (14.0)

I'm still having some issues with reliability of the network, puts succeed 100% of the time, but my read success is more around the 50% mark, and it gets worse as the network scales. I guess this is expected, but even after using Direct replication - indirect replication does not work on the latest Beta version :(

Is there any way that you could help me out to try and make my reads more reliable @tbocek

Hi @iggydv, thanks for the detailed report. My advice is to enable as much logging as possible and go through the logs to see where it got stuck. Maybe upgrading the Netty library could help.

One problem I never could really solve is when using TCP and short lived connections. Netty does a pretty good job in client/server communication, but when it comes to fast paced short lived TCP connection (even shutting down before the connection was established), then I faced issues.

Thus, we are currently looking into a UDP-based protocol to make the connections more suitable for a P2P setting. The current repo is here: https://gitlab.com/p2p-library-in-golang/code