supranational/supra_seal

C2 not on curve

llifezou opened this issue · 19 comments

When using SupraSeal C2 for C2 proof in lotus, the result report is not on curve
log:

c2 0 start proof computation
2023-08-22T06:26:08.794 DEBUG filcrypto::util::types > seal_commit_phase2: start
2023-08-22T06:26:08.988 INFO filecoin_proofs::api::seal > seal_commit_phase2:start: SectorId(0)
2023-08-22T06:26:08.988 INFO filecoin_proofs::caches > trying parameters memory cache for: STACKED[68719476736]
2023-08-22T06:26:08.988 INFO filecoin_proofs::caches > no params in memory cache for STACKED[68719476736]
2023-08-22T06:26:08.988 INFO storage_proofs_core::parameter_cache > parameter set identifier for cache: layered_drgporep::PublicParams{ graph: stacked_graph::StackedGraph{expansion_degree: 8 base_graph: drgraph::BucketGraph{size: 2147483648; degree: 6; hasher: poseidon_hasher} }, challenges: LayerChallenges { layers: 11, max_count: 18 }, tree: merkletree-poseidon_hasher-8-8-2 }
2023-08-22T06:26:08.988 INFO storage_proofs_core::parameter_cache > ensuring that all ancestor directories for: "/data2/params/v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-2-sha256_hasher-96f1b4a04c5c51e4759bbf224bbc2ef5a42c7100f16ec0637123f16a845ddfb2.params" exist
2023-08-22T06:26:08.988 INFO storage_proofs_core::parameter_cache > checking cache_path: "/data2/params/v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-2-sha256_hasher-96f1b4a04c5c51e4759bbf224bbc2ef5a42c7100f16ec0637123f16a845ddfb2.params" for parameters
2023-08-22T06:26:08.988 INFO storage_proofs_core::parameter_cache > Verify production parameters is false
2023-08-22T06:26:23.368 INFO storage_proofs_core::parameter_cache > read parameters into SuprasSeal from cache "/data2/params/v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-2-sha256_hasher-96f1b4a04c5c51e4759bbf224bbc2ef5a42c7100f16ec0637123f16a845ddfb2.params"
2023-08-22T06:26:23.369 TRACE filecoin_proofs::api::seal > got groth params (68719476736) while sealing
2023-08-22T06:26:23.369 TRACE filecoin_proofs::api::seal > snark_proof:start
2023-08-22T06:26:23.391 INFO bellperson::groth16::prover::supraseal > Bellperson 0.25.0 with SupraSeal is being used!
2023-08-22T06:27:51.387 INFO bellperson::groth16::prover::supraseal > synthesis time: 87.996527s
2023-08-22T06:27:51.387 INFO bellperson::groth16::prover::supraseal > starting proof timer
2023-08-22T06:28:21.495 INFO bellperson::groth16::prover::supraseal > prover time: 30.107821573s
2023-08-22T06:28:28.150 DEBUG filcrypto::util::types > seal_commit_phase2: end
2023-08-22T06:28:28.150Z        WARN    lotus-bench     lotus-bench/main.go:115 not on curve

Stack backtrace:
   0: storage_proofs_core::compound_proof::CompoundProof::circuit_proofs::{{closure}}
   1: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
   2: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
   3: storage_proofs_core::compound_proof::CompoundProof::circuit_proofs
   4: filecoin_proofs::api::seal::seal_commit_phase2
   5: filecoin_proofs_api::seal::seal_commit_phase2_inner
   6: filecoin_proofs_api::seal::seal_commit_phase2
   7: std::panicking::try
   8: filcrypto::util::types::catch_panic_response
   9: seal_commit_phase2
  10: _cgo_be609e58ba65_Cfunc_seal_commit_phase2
             at /tmp/go-build/cgo-gcc-prolog:619:11
  11: runtime.asmcgocall
             at /usr/local/go/src/runtime/asm_amd64.s:844

machine information:
AMD EPYC 7302P 16-Core Processor
Ubuntu 20.04.6 LTS
NVIDIA GeForce RTX 3080

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

nvidia-smi
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2

gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

Tried multiple machines, tried multiple driver versions (CUDA Toolkit 12.2.0 CUDA, Toolkit 12.0.0, CUDA Toolkit 11.8.0), all of which reported the same error.

May I ask what is the reason and how to solve it. thanks

Could you send us the failing commit-phase1 output if possible?

c2_in.json.zip
Here is my commit-phase1 input. thanks

vmx commented

@sandsentinel in order to read that into Rust, that should do it:

cat c2_in.json|jq -r '.Phase1Out'|base64 --decode > precommit-phase1-output

If you need further help with getting it into Rust, let me know.

When executing the c2 test: cargo test --release --test c2 -- --nocapture (following the steps of https://github.com/supranational/supra_seal/blob/main/c2/README.md exactly), it happens System exception: kernel:[157016.634176] watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [c2-62cfd3e18ec6:206324]

image

AMD EPYC 7302P 16-Core Processor
Ubuntu 20.04.6 LTS
NVIDIA GeForce RTX 3080

After executing lotus-bench, after not on curve appears, it will also happen:kernel: [157016.634176] watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [c2-62cfd3e18ec6:206324]

cat c2_in.json|jq -r '.Phase1Out'|base64 --decode > precommit-phase1-output

Is there a way to produce directly the output of commit phase 1? A file similar to this?

vmx commented

Is there a way to produce directly the output of commit phase 1? A file similar to this?

The output of the command should be exactly that. Does it fail parsing?

The error has a strong "hardware error" vibe, say a memory error faulty memory module, but then it shouldn't persist across multiple systems. Even if the systems are identical. How many systems did you try it on? Are they identical? Do all of them fail to execute the builtin C2 cargo test? (Let's put aside everything but the builtin C2 test for now.)

Here is another thought. Is it possible that it's a deferred error from reading SRS? I mean if the SRS file is corrupted and contains a not-on-curve point, then it will corrupt the final result. If you populate your SRS-s from a common local source, it's not impossible to imagine that this local source got corrupted because of a hardware problem on the machine that initially fetched the parameters. How do you populate your SRS-s to multiple machines? (For reference, the [blst] deserialization does perform the on-curve check, but the return values are currently ignored.)

Just in case, the soft lockup can't be triggered by a not-on-curve point in SRS. It's a problem by itself and is likely an indication of a hardware problem. If multiple machines exhibit the same problem, it's more likely to be a common hardware configuration problem that the kernel can't handle correctly.

Does it fail parsing?

It fails like below. Here's a link to the failing line. The file provided was for a 64 GiB sector, right? I changed any 32 GiB structures/references to 64 GiB prior to running.

running 1 test
*** Restoring commit phase1 output file
commit_phase1_output_bytes len 39165880
thread 'run_seal' panicked at tests/c2.rs:38:54:
called `Result::unwrap()` on an `Err` value: Custom("invalid value: integer `975332975`, expected variant index 0 <= i < 3")

(Let's put aside everything but the builtin C2 test for now.)
Builtin C2 cargo test same error

image
running 1 test
*** Restoring commit phase1 output file
commit_phase1_output_bytes len 11014808
Reading SRS file "/var/tmp/filecoin-proof-parameters/v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.params"

Starting seal_commit_phase2
thread 'run_seal' panicked at 'called `Result::unwrap()` on an `Err` value: not on curve', tests/c2.rs:71:87
stack backtrace:
   0: rust_begin_unwind
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/result.rs:1651:5
   3: core::ops::function::FnOnce::call_once
   4: core::ops::function::FnOnce::call_once
             at /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test run_seal ... FAILED

failures:

failures:
    run_seal

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 312.89s

The parameter proof file should be correct (md5)
5c65538166f1e3d999db7afa91a60862 v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.params

Next I'll try to look for hardware configuration issues, and I'll update here if there's any progress. Do you have a suggestion? @dot-asm

Same input, works fine with the following hardware configuration:
7542 cpu + ubuntu 2204 + 2080ti(cuda 12.2)
image

@vmx @dot-asm @sandsentinel

Again, how many systems did you try the builtin C2 cargo test on, and how many of them fail to execute it?

As for the successful test. What happens if you copy the test binary from the failing system and execute it directly from command line on the working system. I mean when you run cargo test --release it says Running tests/c2.rs (target/release/deps/c2-<a-hash>), and the suggestion is to copy the c2-<a-hash> and just run it. It's likely to complain about missing commit_phase1_output_path, in which case create the required directory and copy even commit-phase1-output file. If it fails, try newer C++ compiler on the failing u20 system, e.g. as env CXX=g++-10 cargo test --release [after apt-get install g++-10 if necessary].

What happens if you copy the test binary from the failing system and execute it directly from command line on the working system.

I have tested on multiple ubuntu2004, all with the same result. So I upgraded one of the machines to 2204, and the result changed, whether it is cargo test or lotus-bench, they all work normally
u20

./c2-62cfd3e18ec64139

running 1 test
test run_seal has been running for over 60 seconds
test run_seal ... FAILED

failures:

---- run_seal stdout ----
*** Restoring commit phase1 output file
commit_phase1_output_bytes len 11014808
Reading SRS file "/var/tmp/filecoin-proof-parameters/v28-stacked-proof-of-replication-merkletree-poseidon_hasher-8-8-0-sha256_hasher-82a357d2f2ca81dc61bb45f4a762807aedee1b0a53fd6c4e77b46a01bfef7820.params"
Starting seal_commit_phase2
thread 'run_seal' panicked at 'called `Result::unwrap()` on an `Err` value: not on curve', tests/c2.rs:71:87
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

u22

 ./c2-62cfd3e18ec64139

running 1 test
test run_seal has been running for over 60 seconds
test run_seal ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 149.57s

For the configuration of my machine, u22 works fine.
So I think you're right that it's more likely to be a common hardware configuration problem that the kernel can't handle correctly.

@dot-asm

Cool! So it's not faulty hardware, nor compiler bug (*), but a kernel issue. Could you provide outputs from uname -r on a failing system (and a working one)? This is for future reference in case somebody else runs into something similar.

Now, you mentioned that lotus-bench worked. Was it using the 64GiB sector size? In essence we ourselves haven't explicitly tested 64GiB (yet), but there is nothing sector-size specific in the C2 CUDA core, hence there is no reason for us to believe that it wouldn't work.

(*) I've tried gcc-9 on u22 upon initial attempt to reproduce the problem, but u22 offers gcc-9.5.0, while u20 has gcc-9.4.0. So the outcome was inconclusive.

In essence we ourselves haven't explicitly tested 64GiB (yet)

I tested a 64GiB c2 a few hours ago and it passed. We should be safe there.

In essence we ourselves haven't explicitly tested 64GiB (yet)

I tested a 64GiB c2 a few hours ago and it passed. We should be safe there.

Oh! Cool! Thanks! (Next time do tell, so [I] don't bring up stale information:-)

Cool! So it's not faulty hardware, nor compiler bug (*), but a kernel issue. Could you provide outputs from uname -r on a failing system (and a working one)? This is for future reference in case somebody else runs into something similar.

Now, you mentioned that lotus-bench worked. Was it using the 64GiB sector size? In essence we ourselves haven't explicitly tested 64GiB (yet), but there is nothing sector-size specific in the C2 CUDA core, hence there is no reason for us to believe that it wouldn't work.

(*) I've tried gcc-9 on u22 upon initial attempt to reproduce the problem, but u22 offers gcc-9.5.0, while u20 has gcc-9.4.0. So the outcome was inconclusive.

This is the result on my machine. (Maybe different hardware configurations will have different results)

good

uname -r
5.15.0-82-generic

gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

bad

uname -r
5.4.0-144-generic

gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

The issue should be resolved and I close the issue. Thank you for everything. it's really great.
@vmx @dot-asm @sandsentinel

bad: 5.4.0-144-generic

Oh! It's not even the latest update, 144 is 9 kernel updates behind. Either way, it's possible to that the problem could have also been resolved by installing linux-generic-hwe-20.04, which gives you 5.15.0-82 on u20. As opposed to upgrading to u22 that is.