relab/hotstuff

Use protobuf types instead of hotstuff-defined Go types with translation layers

Opened this issue · 3 comments

It would be nice to avoid translations between protobuf types and hotstuff-defined types, such as those in hotstuffpb and corresponding translations. Such translations slows things down, adds memory allocation overhead, and is prone to errors in the translation code.

Looking at some profiles, it doesn't seem like the conversions are slowing things down much.

Here's a memory profile:

File: hotstuff
Type: alloc_space
Time: Apr 16, 2022 at 12:03pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 652.07MB, 57.91% of 1126.06MB total
Dropped 174 nodes (cum <= 5.63MB)
Showing top 10 nodes out of 173
      flat  flat%   sum%        cum   cum%
  156.01MB 13.85% 13.85%   156.01MB 13.85%  math/big.nat.make
  107.02MB  9.50% 23.36%   212.02MB 18.83%  github.com/relab/hotstuff/crypto/ecdsa.ThresholdSignature.ToBytes
   82.50MB  7.33% 30.69%    82.50MB  7.33%  math/big.(*Int).Bytes (inline)
      78MB  6.93% 37.61%   138.51MB 12.30%  github.com/relab/hotstuff/crypto/ecdsa.Signature.ToBytes
   57.50MB  5.11% 42.72%    57.50MB  5.11%  context.WithCancel
   53.50MB  4.75% 47.47%    53.50MB  4.75%  reflect.New
   35.51MB  3.15% 50.62%    35.51MB  3.15%  google.golang.org/protobuf/proto.MarshalOptions.marshal
   34.51MB  3.06% 53.69%   212.03MB 18.83%  github.com/relab/hotstuff/crypto.(*cache).VerifyThresholdSignature
      24MB  2.13% 55.82%       24MB  2.13%  google.golang.org/protobuf/internal/impl.consumeBytesNoZero
   23.51MB  2.09% 57.91%    24.01MB  2.13%  google.golang.org/grpc.(*parser).recvMsg

memprofile.pb.gz

Here's a cpu profile:

File: hotstuff
Type: cpu
Time: Apr 16, 2022 at 12:01pm (CEST)
Duration: 70.65s, Total samples = 74.83s (105.91%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 43400ms, 58.00% of 74830ms total
Dropped 770 nodes (cum <= 374.15ms)
Showing top 10 nodes out of 217
      flat  flat%   sum%        cum   cum%
   13490ms 18.03% 18.03%    15130ms 20.22%  syscall.Syscall
   13130ms 17.55% 35.57%    13130ms 17.55%  runtime.futex
    6400ms  8.55% 44.13%     6400ms  8.55%  p256MulInternal
    3090ms  4.13% 48.26%     3090ms  4.13%  p256SqrInternal
    1710ms  2.29% 50.54%     1710ms  2.29%  runtime.epollwait
    1450ms  1.94% 52.48%     1450ms  1.94%  crypto/elliptic.p256OrdSqr
    1390ms  1.86% 54.34%     6610ms  8.83%  crypto/elliptic.p256PointDoubleAsm
     990ms  1.32% 55.66%      990ms  1.32%  crypto/elliptic.p256Sqr
     890ms  1.19% 56.85%     3540ms  4.73%  runtime.mallocgc
     860ms  1.15% 58.00%      860ms  1.15%  runtime.nextFreeFast (inline)
(pprof) top -cum
Showing nodes accounting for 13.95s, 18.64% of 74.83s total
Dropped 770 nodes (cum <= 0.37s)
Showing top 10 nodes out of 217
      flat  flat%   sum%        cum   cum%
     0.04s 0.053% 0.053%     17.25s 23.05%  github.com/relab/hotstuff/crypto.(*cache).Verify
         0     0% 0.053%     15.77s 21.07%  github.com/relab/hotstuff/crypto/ecdsa.(*ecdsaCrypto).Verify
     0.02s 0.027%  0.08%     15.75s 21.05%  crypto/ecdsa.Verify
         0     0%  0.08%     15.73s 21.02%  crypto/ecdsa.verify (inline)
     0.02s 0.027%  0.11%     15.73s 21.02%  crypto/ecdsa.verifyGeneric
     0.08s  0.11%  0.21%     15.16s 20.26%  runtime.mcall
    13.49s 18.03% 18.24%     15.13s 20.22%  syscall.Syscall
     0.04s 0.053% 18.29%     15.06s 20.13%  internal/poll.ignoringEINTRIO
     0.20s  0.27% 18.56%     14.54s 19.43%  runtime.schedule
     0.06s  0.08% 18.64%     14.11s 18.86%  github.com/relab/hotstuff/crypto/ecdsa.(*ecdsaCrypto).VerifyThresholdSignature.func1

cpuprofile.pb.gz

When I am doing experiments to derive the highest possible throughput from the implementation, I looked at the CPU profile and memory profile of the replicas at the current maximum of 300kops, it appears it is spending considerable time in GC around 11 % after I set GOGC to 2000, before that it was close to 16%.

Screenshot 2022-04-19 at 02 56 18

In the memory profile, a significant majority of the allocation is to translate from proto to hotstuff-defined structures.

Screenshot 2022-04-19 at 02 57 16

I guess this is the evidence we need for the performance impact this is having. I haven't studied this in-depth yet, but is it a large change to remove these translations? It would be interesting to see if we can boost the throughput even more without these translations.