code-423n4/org

Validator role

Opened this issue · 30 comments

Since there will be plenty of questions/discussion, I am frontrunning these by opening an issue here to have all the comms in one place.

https://code4rena.com/blog/code4rena-spring-update-2024

Do I need to stake 32 ETH to be a validator?

Just some initial thoughts:

For Validator-improved submissions: if the judge believes the validator added a measurable enhancement, they get a split of the value of the issue:

  • 25% cut → small enhancement = moved submission from unsatisfactory to satisfactory
  • 50% cut → med enhancement = moved submission from invalid to valid
  • 75% cut → large enhancement = identified a more severe vulnerability

What's the motivation behind having validators improve submissions by wardens? I find it strange that validators can modify findings such that it affects the final payout distribution amongst wardens.

Let's say a validator builds on a report to change it from invalid to valid. Wouldn't this then be unfair to other wardens who identified valid findings by themselves? Additionally, the warden who submitted the original report is now paid for submitting an invalid report, which I find strange.

Also, what's the difference between valid/invalid and unsatisfactory/satisfactory?

Judges can play the Judge and Validator role on the same audit.

Isn't this a conflict of interest? The judge can improve a submission as a validator and then give himself a 75% cut.

I am reposting a revision of feedback that was previously shared privately in the #judges channel on Discord:

General Feedback

There's definitely a lot to digest with the latest Code4rena Spring Update, specifically around the new Validator role and how it functions within the overall contest flow.

As preliminary feedback, I believe that the changes introduced in the Spring Update are substantial and therefore merit a gradual roll-out via "experimental" contests before the latest changes are applied to all upcoming contests.

Validator % Capture

Problem Statement

Multiple safeguards must be set in place for this feature to function as Code4rena originally envisioned. Submissions of a warden having their proceeds redirected based on non-quantitative measurements (i.e. increase of quality, introduction of PoC that may not be needed, etc.) is bound to cause fierce debates, especially at the current level of granularity proposed (25%, 50%, 75%).

Proposed Solution

In my opinion, the most transparent way to enforce the new rules around Validators (albeit slightly technically challenging on C4's side) would be for the judge to be unaware of which submission was original and which was improved.

This will eliminate prejudice which, despite how transparent a judge may be, is bound to happen depending on how data is presented to a judge. Additionally, this will allow a "fairer" evaluation of what cut the Validator should acquire, given that C4's toolkit can simply compare whether no change was detected (i.e. they are marked as duplicates), and so on. The above adjustment will also permit underperforming Validators to be detected and lowered in the selection pool.

Validator Post-Contest Finding

Problem Statement

Another albeit smaller problem with the above is that a validator has an effective extension on the contest's deadline by being able to "submit issues" beyond the deadline by editing a very low-effort ambiguous finding and introducing details to move it from "invalid" to "valid".

Proposed Solution

No dedicated solution needs to exist for this, and guideline changes on the C4 website may need to be introduced for Judges. Specifically, judges who review "improved" findings must assess whether one can reasonably be deduced from the other and if not, to invalidate both the improved and non-improved variant. To note, this approach is compatible with the previous chapter's recommendation by being introduced as a step beyond the first "judge passing" validation.

Validator Misaligned Incentive

Problem Statement

I understand that there is a level of trust associated with all roles in C4, however, when evaluating whether a feature in the contest's process is appropriate, we must consider the scenario of a malicious Validator however unlikely. In the recently rolled-out feature of presorters invalidating QAs and Gas Reports, low-quality submissions were deliberately not included in this feature to prevent any finding being missed by the Judge of a contest.

The newly proposed feature permits a Validator, on an entirely different repository not monitored by the Judge, to invalidate findings before they ever reach the team, the judge, or the presorter. In and of itself, this permits a Validator who has participated in the contest in another capacity to eliminate potentially valid findings.

Proposed Solution

Each validator of a particular contest acts independently of others. Afterward, the presort is responsible for picking the "best" revision of a particular finding that the validators created. This will significantly reduce the likelihood of a single malicious Validator having any effect on the final findings that the Judge ultimately reviews.

Warden Disputes

Problem Statement

The present mechanism has not made it clear how Wardens are able to dispute findings that arise from Validator "enhancements". Additionally, the scenario of a warden participating in a contest without backstage access, acquiring one afterward, and raising concerns within the PJQA window has not been addressed. As a final consideration, a finding that was adjusted by a Validator may eventually be debated by both the validator that enhanced it as well as the finding's original author which, in my opinion, will be a very tricky situation to resolve.

Proposed Solution

PJQA concerns should be raisable by Wardens to also accommodate for validator upgrades. Per the earlier concerns raised in this post, wardens should be able to raise a PJQA concern to request a particular finding that was upgraded by a validator from invalid to valid to be re-assessed for reasonable deduction, as an example.

Concluding Thoughts

I would like to reiterate that I do not believe the feature is presently optimal for production at this stage, and in my humble opinion requires further iteration by multiple judges before being ubiquitously applied to all upcoming contests.

I concurs with @alex-ppg and @MiloTruck, this feature is not ready, there is definitely conflicts of interests at the base of the foundation which require more refinement. I'm also curious what is the main objective to introduce this role and what problem we are trying to solve. The fact that a Validator (that could be even a judge) can improve any finding, it's not a fair situation for all the participants, as this actor will have a view on all the submitted reports, and from that could emerge or spark some ideas to improve some reports, which nobody had access too during the contest, and will negatively impact most of the wardens pot in favor of the Validator and the improved warden's report. It's like insider trading type of thing, the Validator have insider knowledge based on all the submissions, which can allow him to boost report in his favor once the contest is concluded.

Personally, I think the idea of letting validators improve warden submissions is really, really good. It will lead to higher quality submissions overall and reduce the burden on sponsors and judges of validating incomplete submissions.

That being said, sponsor input is a key part of the judging lifecycle, and the new process places said burden squarely on the validators' shoulders who have no sponsor input at that point. Validators are now expected to make a binary decision on all findings and either mark them as "valid" and be penalized if they aren't, or "invalid", in which case it is unclear what the penalty is if an issue is successfully escalated in PJQA.

If we expect what used to be "sufficient quality submission" findings to now be "valid", then the penalty for missing a valid finding must be higher than that for false positives, as otherwise the incentives become heavily skewed towards rejecting ambiguous findings. If I believe a finding has a 30% chance of being valid, then I am incentivized to mark it as "invalid", whereas it used to be the case that they would be forwarded to the sponsor + judge to make the decision. Not penalizing false negatives more harshly or at all will lead to more PJQA appeals and potentially missed valid findings.

I also honestly don't see how the proposed mechanisms of validating in batches of 5 findings and allowing for enhancements are compatible with one another, as the validator will be blind to the complete set of duplicates of a given finding and may choose to enhance it, while a more complete duplicate that negates the need for an enhancement may exist later in the queue. It is also unclear how each dupe set is meant to be assigned to a single validator given the batching mechanism.

Moreover, while me and others have found the deduping tool highly useful for presorting, it is not a silver bullet and I believe that if we had some metrics on its accuracy vs final judging results, it would become clear that it fails to correctly label duplicates for a large number of the final unique findings. This change completely strips away the ability (and responsibility) of the lookout to identify duplicates across the full set of findings, and I believe this will negatively impact the validator's work as well as significantly increase the judge's (and likely the sponsor's) workload, who now themselves have to do the real deduping.

So to summarize:

  • False negatives must also be penalized and carry a higher penalty than false positives
  • Deduping will not work. I have no solution for this
  • Enhancements seem hard to implement without proper deduping. Ideally deduping would happen first, and then only the best duplicate would be eligible for enhancement.

Judges can play the Judge and Validator role on the same audit.

My recommendation would be to have the judge forego all validator rewards, as is the case as when the judge participates in a contest he's subsequently judging.

So to summarize:

  • Enhancements seem hard to implement without proper deduping. Ideally deduping would happen first, and then only the best duplicate would be eligible for enhancement.

I do not believe that the above bullet point is correct. Presently, a finding does not get a binary judgment of whether it is valid or not; it may be primary, it may be penalized partially, or lack sufficient proof even if it was on the right track. An enhancement of duplicates makes sense, as findings that could have been penalized may ultimately be accepted for a full reward.

The Validator have insider knowledge based on all the submissions

I believe this is a very fair point, and important to consider. There are multiple instances of multiple attack vectors being combined to raise the severity of a single submission. In such instances, all single-vector duplicates are partially awarded (due to being of a lower severity) and are grouped under the one that combines all of them as detailed in the C4 docs.

A Validator will indeed have full access to all submissions after the contest ends and will be able to combine vulnerabilities that would otherwise not have been during the normal course of the contest. As a Validator rule, I think it might be wise to prevent the combination of existing findings reported in the contest to avoid the specified scenario.

I understand the value added by a combination of vulnerabilities in the ultimate Sponsor output, however, accepting such combinations would effectively be unfair and indeed use privileged information.

Is it wise to give validators access to the usual findings repo when they can use this as inspiration to enhance issues? I feel this would be very detrimental to top wardens, which is contrary to the stated goal of the Spring update.

Beginner/Intermediate Finding:

  • "There is a problem with the access control of function X. It is dangerous"
  • Should be invalid since no exploitation path is shown.

Top Warden Finding:

  • "There is a problem with the access control of function X. It is dangerous since there a specific edge case due to the integration on L2 Y when input values are of this specific configuration."
  • Valid Medium Unique

Validator:

  • "I don't fully understand what he is saying, but he's a top warden so: copy/paste/rewrite'
  • Beginner finding => Valid High & validator gets 75%
  • Top Warded finding loses unique status and warden loses a lot of rewards

Would the above example not become a common occurrence? The ability to take a potential or low vulnerability and transform it into a valid unique medium/high due to exceptionally specific knowledge is often what separates the top from the middle in my opinion.

Perhaps it would be best to lock the usual findings repo until the validator work is finished?

Presently, a finding does not get a binary judgment of whether it is valid or not; it may be primary, it may be penalized partially, or lack sufficient proof even if it was on the right track. An enhancement of duplicates makes sense, as findings that could have been penalized may ultimately be accepted for a full reward.

I see what you mean, but I feel like this is not (or shouldn't be) the purpose of the enhancements.

You can improve every single duplicate until it reaches the level of the main finding, but what's the point of that? In that scenario the validator is essentially stealing rewards from all other duplicates, who would have received the partial credit penalty from the other findings. And they are not adding any value since the finding has already been properly demonstrated/articulated elsewhere.

This would just incentivize duplication of efforts and render the concept of partial credit duplicates moot.

I do not believe validators improving warden submissions has material benefit. If we examine the parties involved:

Sponsors:

The protocol receiving the security review does not benefit from this process unless the invalid/unsatisfactory report is either chosen for the actual report, or a unique finding. Adding another warden name to the header above the listed vulnerability is inconsequential to the project's security and overall report.

Wardens:

Those with satisfactory/valid reports are significantly and materially negatively impacted by this process. The rewards would effectively get reorganized based due to increased dupes, but it would be rare for a warden to actually receive more reward due to this process. Those with unsatisfactory/invalid reports benefit moderately from this process, as they will earn rewards they would otherwise not be entitled to, but the rewards are split between them and the validator.

Judges:

There is a moderate impact on judging due specifically to the enhancement part of the change, in that the judges have to determine the validity of the changes etc. However, the validation process should significantly reduce judge workload overall, so the added work from the enhancement judging is somewhat a moot point.

Validators:

As this is a newly created role and there are no validators yet, there is no delta to consider benefit/detriment. The validator role would benefit from the added authority and financial gain, but removing it does not detriment anybody. This may cause less interest in the validator role due to reduced reward potential.

Alternative

While personally I do not think the enhancement is of benefit, should it be included I propose an alternate idea:

An alternative solution that includes validator enhancement: The validator's have their own pool they compete with each other for based on the number of unsatisfactory -> satisfactory / invalid -> valid reports and severity. Unsatisfactory/invalid reports added to the prize pool are eligible for reduced prize shares.

Some of my final thoughts on Validators in this current version. Seems this discussion is dead, even thought Validators will be live for all audits in 2 days...

In the new system, if you are not a low level warden or a validator, you will receive LESS rewards

  • Validators will mean that there are more duplicates, meaning most Wardens will receive less rewards for their findings. The only people that benefit are the Validators themselves, thanks to the Validator pool rewards and % of the upgraded finding's rewards, plus the handful of lower level wardens who have low report accuracy and get one of their reports upgraded and accepted. I.e high report accuracy Wardens get their rewards diluted in favour of Validators and low accuracy Wardens.

I believe projects do not benefit much from the current Validator model:

  • Projects do not benefit much, as most upgraded reports will just be duplicate issues. Projects would only benefit if a unique finding that would otherwise be rejected, became accepted thanks to a Validator. However this is an unlikely scenario, as Validators can only upgrade reports from the low accuracy pool, and most uniques are found by experienced and top level Wardens.

This will also lead to more total reports:

  • Low level Wardens will become incentivised to submit lower quality reports, as they can be updated by Validators who are likely more skilled than lower level wardens who have their findings sent to the Validation pool. Meaning low level wardens can submit any findings they are not fully confident in due to a lack of knowledge, and hope Validators will update the report to an acceptable level.

An example of reward dilution by Validators:

Warden A submits a report for a bug with 10 duplicates total, Warden A receives 10 % of the rewards.

Validator A submits the same bug (as they can compete and validate the same contest), however they also upgrade a low level report from Warden B to become accepted, now they have change the bug to have 11 duplicates, however they also receive an extra 50% of the new finding:

Warden A reward: 1/11 = 9%
Warden B reward: 1/11 * 50% = 4.5%
Validator A reward: 1/11 + (1/11 * 50%) = 9% + 4.5% = 13.5%

Validator A has diluted Warden A's (and all other wardens who submitted this same bug report) rewards and increased his own reward from 10% to 13.5% for this one bug. The more duplicates Validator A upgrades, the higher their own reward and lower rewards of all other Wardens who have high report accuracy scores

My initial calculations are incorrect due to reward dilution. Actually Validators are incentivised to create more duplicates for issues they have not reported to dilute the total rewards for those issues, which will increase rewards for issues they have reported.

Warden A reward: 1/11 = 9%
Warden B reward: 1/11 * 50% = 4.5%
Validator A reward: 1/11 + (1/11 * 50%) = 9% + 4.5% = 13.5%

This is not how reward shares are calculated. While technically the percentages of each party receives for the vulnerability is correct if we ignore the selected for report multiplier, the total pool of funds for that vulnerability decreases with duplicates, so the wardens receive less than what you explain here.

Shares are calculated on an exponential distribution.

Med Risk Slice: 3 * (0.9 ^ (split - 1)) / split
High Risk Slice: 10 * (0.9 ^ (split - 1)) / split

The reward shares are calculated on an exponential curve to discourage sybil attacks, which arguably makes the calculations even worse for wardens. The actual impact on how much of a change this causes is dependent on too many factors to calculate simply.

I highly think validators shouldn't take a percentage of the main H/M Pot. This will create a lot of issues (already cited in the discussions above). I however also see the value of validators in improving the experience for sponsors (Client is king). I think Validators should compete on their own reward pots and they should get their own point system of how much did they help improve the readability and quality of the reports and they should get rewarded according to how much effort they put in validating: e.g of a point system of validators (that could also aim to improve speed of judging). For each issue u set as valid and send it to the main repo -> u get 1 point. If the judge deem this issue later as unsatisfactory or low quality u loose 2 points. Validators will be incentivised to do their job quickly (to farm the points earlier) and also to validate and improve the quality of the findings submitted. Reguarding, rejecting invalid or LLM generated reports, i don't have a recomendation yet. But I think it is a food for though to consider. Taking shares of the rewards / reducing rewards of other wardens, after the submissions are finished and with access to the high quality submissions, in my opinion will only create dramas and escalation nightmares that wouldn't benefit the C4 brand long-term

Warden A reward: 1/11 = 9%
Warden B reward: 1/11 * 50% = 4.5%
Validator A reward: 1/11 + (1/11 * 50%) = 9% + 4.5% = 13.5%

This is not how reward shares are calculated. While technically the percentages of each party receives for the vulnerability is correct if we ignore the selected for report multiplier, the total pool of funds for that vulnerability decreases with duplicates, so the wardens receive less than what you explain here.

Shares are calculated on an exponential distribution.

Med Risk Slice: 3 * (0.9 ^ (split - 1)) / split High Risk Slice: 10 * (0.9 ^ (split - 1)) / split

The reward shares are calculated on an exponential curve to discourage sybil attacks, which arguably makes the calculations even worse for wardens. The actual impact on how much of a change this causes is dependent on too many factors to calculate simply.

Understood, thanks for the clarification. I saw someone already corrected me in Discord, and they pointed out that actually the opposite to what I pointed out is true. That Validators are incentivised to create more duplicates for issues they have not reported themselves, as this will decrease the rewards for those issues, whilst increasing the rewards for the issue they have reported themselves but are not creating more duplicates for. I.e Validators are incentivised to reduce rewards for all issues they have not reported, reducing rewards for all other wardens apart from themselves.

Thank you all for the very constructive critique of the Validator role. Based on the input here, I want to offer some answers to questions, and adjustments to the mechanisms as originally outlined.

Questions and answers

Wardens appealing improved submissions

The present mechanism has not made it clear how Wardens are able to dispute findings that arise from Validator "enhancements".

Backstage wardens can certainly appeal the improvements during PJQA. [Edited to clarify that this is a backstage / PJQA process.]

Will QA issues be accounted for 50% accuracy?

Yes - if a Validator forwards a QA report to the findings repo as satisfactory, and the judge deems it unsatisfactory, it will count against their accuracy rating.

Changes to original guidelines

Validators may not improve duplicates of submissions already present in the findings repo

Several commenters noted that:

  • Validator-improved submissions could dilute HM payouts by generating more dupes
  • Validators could unfairly improve issues using proof available in the findings repo
  • Validators may be incentivized to improve dupes of issues they did not submit, to shrink the pie — and ignore dupes of their own issues

Given these critiques, we are updating the Validator guidelines as follows: if the finding is already present in the findings repo, then improved submissions will be judged in their original versions, and improvements disregarded.

Validators can only improve HM submissions, not QA/Gas reports

To clarify the Validator guidelines: improvements are only allowed for High and Medium-risk submissions (HMs), not QA or Gas reports.

Updated mechanism for selecting the top 3 QA reports

A few concerns were raised about the "top 3" approach to QA awards:

  • Risk that post-judging QA arguments could take on a fierce "PVP" quality as wardens argue for a spot in the top 3
  • Risk of variability in judging, due to the complexity of judging QA reports in general, and of choosing a top 3 in general
  • High pressure on the individual judge

To address these concerns, we're proposing a slightly modified approach:

  • The "top 3" approach stands for the purposes of awarding.
  • During judging the judge will assess QA submissions as satisfactory or unsatisfactory.
  • Immediately following the post-judging QA phase, the RSVP'd Validators and the judge will participate in a blind vote administered by C4 staff.
  • Each judge and validator has 3 points, which they can assign to 1-3 satisfactory reports based on what they think is best. (They may choose to allocate multiple votes to one report if they prefer.)
  • The top 3 results (1st, 2nd, and 3rd place) are based on who has the most points from that set of votes.
  • The vote will be considered final and cannot be appealed.
  • The same approach will be used for Gas reports, when present.
  • QA and Gas reports closed by Validators are excluded from post-judging QA.

Penalties for false negatives

If we expect what used to be "sufficient quality submission" findings to now be "valid", then the penalty for missing a valid finding must be higher than that for false positives, as otherwise the incentives become heavily skewed towards rejecting ambiguous findings.

If I believe a finding has a 30% chance of being valid, then I am incentivized to mark it as "invalid", whereas it used to be the case that they would be forwarded to the sponsor + judge to make the decision. Not penalizing false negatives more harshly or at all will lead to more PJQA appeals and potentially missed valid findings.

Updated consequences for Validators:

  • false positives count 50% toward your submission accuracy
  • false negatives count 100% toward your submission accuracy

In other words, each false negative is double the negative value of false positives in the ‘incorrect’ count.

Clarifications

(Some of these questions were asked in Discord.)

Judges working as Validators

“Judges can play the Judge and Validator role on the same audit.”
Isn't this a conflict of interest? The judge can improve a submission as a validator and then give himself a 75% cut.

Judges may participate as Validators, but are not eligible for any HM pool payouts on audits they judge — even if they enhance an issue.

QA and Gas validation

  • Each Validator is assigned total_reports / total_validators to review, in addition to 5 initial HM issues.
  • Each validator should nominate the top 3 of their own reviewed QA reports and forward them forward all satisfactory QA reports to the findings repo for judging. [Edited 2024-05-01 to correct an error]

Rules for % cuts & enhancements - root cause

I have some doubts on this part:

25% cut → small enhancement = moved submission from unsatisfactory to satisfactory

50% cut → med enhancement = moved submission from invalid to valid

75% cut → large enhancement = identified a more severe vulnerability

If an improvement falls into 75% case with a complete different root cause, should the original submission be awarded 25%? Because that turns out to be another submission then.
Or are the Validators allowed to inject a new finding (in other words)?

Improved submissions must share the same root cause as the original submission.

Assignment process

3-5 validators review the submissions -- what if two validators try to enhance the same finding? Or more generally, do validators sync their work with each other?

There's an assignment process within the validation repo that will restrict these permissions to the Validator assigned to the submission. Therefore only one Validator would be allowed to enhance a finding.

An issue that we saw with the bot races, was that some bots submitted gas findings that sounded plausible, but when you tested it, or understood the mechanisms behind it, it was invalid. Without PJQA I fear the same will happen here with the three-vote system, for both gas and qa reports.

@IllIllI000 wrote:

Without PJQA I fear the same will happen here with the three-vote system, for both gas and qa reports.

QA reports in the findings repo will be eligible for PJQA reassessment for satisfactory/unsatisfactory; only the closed reports in the validation report will be excluded.

The anti-outcome here is intense PVP behaviour among wardens vying for 1st/2nd/3rd place, and the proposed voting system is designed to avert that.

My hope is also that since judges are now responsible for sorting QA reports into satisfactory / unsatisfactory, that the initial assessment will be more thorough.

Changes to original guidelines

Validators may not improve duplicates of submissions already present in the findings repo

Several commenters noted that:

  • Validator-improved submissions could dilute HM payouts by generating more dupes
  • Validators could unfairly improve issues using proof available in the findings repo
  • Validators may be incentivized to improve dupes of issues they did not submit, to shrink the pie — and ignore dupes of their own issues

Given these critiques, we are updating the Validator guidelines as follows: if the finding is already present in the findings repo, then improved submissions will be judged in their original versions, and improvements disregarded.

This suggests that a validator can only improve an invalid/unsatisfactory finding if the finding is unique. I highly doubt that this will ever be the case. At this point it seems like we've lost the whole point of why the process of improving submissions even was proposed.

  • What exactly is the problem that we are trying to address that validators improving invalid/unsatisfactory submissions solves?

  • Does the process with this new guideline even still solve this problem?

  • Is it even plausible that an invalid/unsatisfactory submission will be for a unique vulnerability that a validator can "improve" to become valid/satisfactory, and can you point to specific examples of contests/findings that this would have been useful for?

QA reports in the findings repo will be eligible for PJQA reassessment for satisfactory/unsatisfactory; only the closed reports in the validation report will be excluded.

But what if the scores of the winning reports are based on false assumptions of the judges/validators that some gas findings are valid while they're not? If that comes up during PJQA would that lead to a reassessment of the score of those reports?

I think judge/validators should point out any invalid findings that they're aware of while assigning the score, and if there's a significant amount of findings that are found to be invalid (or in the 4naly3er report) later on during PJQA then the judge and validators should reevaluate their score.

Edit: NVM, just realized that the vote is after PJQA, so wardens can point out to invalid findings during then.

I believe that the rule iterations are a welcome change, however, I will maintain that this type of feature should have been in a peer review process for at least a few weeks before being shipped to production. We have observed that rule changes have come into effect as a result of judge feedback in just a few days, and this is a strong indicator that the feature should not have been shipped to production.

I would like to provide some follow-up feedback on the revised ruleset:

Wardens appealing improved submissions

Backstage wardens can certainly appeal the improvements during PJQA. [Edited to clarify that this is a backstage / PJQA process.]

Wardens already protest partial rewards based on quality, which is inherently a subjective manner (i.e. some Judges mandate PoCs, others do not). Opening a new escalation avenue will simply deter judges from accepting enhancements to avoid the associated hurdle and conflict that would arise from a Warden protesting the enhancement.

Validators may not improve duplicates of submissions already present in the findings repo

Several commenters noted that:

  • Validator-improved submissions could dilute HM payouts by generating more dupes
  • Validators could unfairly improve issues using proof available in the findings repo
  • Validators may be incentivized to improve dupes of issues they did not submit, to shrink the pie — and ignore dupes of their own issues

Given these critiques, we are updating the Validator guidelines as follows: if the finding is already present in the findings repo, then improved submissions will be judged in their original versions, and improvements disregarded.

Rules for % cuts & enhancements - root cause

Improved submissions must share the same root cause as the original submission.

While I appreciate the effort in identifying a way forward to minimize the "edge" Validators have, I do not believe this to be the right approach. It is impossible to know what duplicates will occur when a project is handed off to the judge (I can personally specify instances with over 20+ re-duplications). As such, this will end up being a Validator effort that ends up being misallocated.

Another problem here is the C4 guideline around privilege escalation. Per C4 guidelines, it is fair game to combine two vulnerabilities to demonstrate a higher severity. What will happen if vulnerability A has no duplicates, vulnerability B has duplicates, and A has been enhanced to combine B for escalation? Similarly, what will happen if A has duplicates, B has duplicates, and only one report is enhanced to escalate both A and B? In such an instance, a Validator will have provided tangible value to the Sponsor so it needs to be accounted for in one way or another.

Will QA issues be accounted for 50% accuracy?

Yes - if a Validator forwards a QA report to the findings repo as satisfactory, and the judge deems it unsatisfactory, it will count against their accuracy rating.

While the rules around HM vulnerabilities are somewhat uniform, the rules of grading QA / Gas Reports is entirely up to the discretion of each judge. As such, having such subjective evaluations impact the accuracy of a Validator is unfair unless clear QA / Gas Report guidelines are set in place.

I understand that the new rule around NC issues and gas optimizations with optimizer turned on are in the right direction, however, there still is no distinction between significant and minuscule gas savings which are awarded unique points by each judge.

Additionally, an HM vulnerability being accepted can fluctuate between H and M whereas a QA report would have a very strict impact (L) which would increase the impact of judge subjectivity when assessing these reports in terms of Validator accuracy.

Judges working as Validators

Judges may participate as Validators, but are not eligible for any HM pool payouts on audits they judge — even if they enhance an issue.

A judge will have an inherent (either subconscious or conscious) bias as to a finding's validity if they have enhanced it, introducing a human factor that will increase error rates in the judging process. As there is no upside for the judge in any case, I propose preventing judges from acting as validators altogether.

QA and Gas validation

  • Each Validator is assigned total_reports / total_validators to review, in addition to 5 initial HM issues.

I propose the mechanism under which wholly indivisible total reports are distributed to be strictly defined (i.e. total_validators = 2, total_reports = 3, total_reports / total_validators = 2 / 3 = 1.5).

Penalties for false negatives

The accuracy penalty is meant to affect a Validator's eligibility to remain one and is a penalty after the fact. As such, an ill-intended Validator can refrain from applying as a Validator until a high-reward contest shows up that they can manipulate / influence. I understand that I am thinking with a malicious mindset, however, C4 has demonstrably suffered manipulation in the past when the stakes are high (i.e. zkSync Era).

C4 should be taking steps to minimize the likelihood of a contest being influenced by a single party, and the current Validator approach is insufficient. I propose that a Validator breaching their accuracy threshold renders the contest in review immediately, causing all "tainted" issues to be re-assessed by the other Validators as well as the Judge.

Updated mechanism for selecting the top 3 QA reports

This is an incredibly well-thought-out feature, and I am in full support of this approach. It will result in a reduction of PJQA escalations for reports, and provide better value to the Sponsor and the ultimate report generated by C4.

  • Each validator should nominate the top 3 of their own reviewed QA reports and forward them forward all satisfactory QA reports to the findings repo for judging.

Noting here for transparency that I have edited this guideline; my initial interpretation was incorrect, and the "nominate top 3" was an artifact of changing the approach to selecting the top 3 reports, and not intended to affect the validation process for QA reports.

Thanks so much, @alex-ppg. Couple quick bullets:

  • Committing to ship is basically intended to make this a high priority RFC by way of Cunningham’s Law: the fastest way to the right answer is to post a wrong (or incomplete) answer. (Note that’s not the cleanest!)
  • Based on that, we’ve had some great input and already iterated quickly.
  • Most of the concerns are about behavior and humans are extremely nondeterministic. There are a wide range of incentives and accountabilities which will shape behavior. And we already have rules which cover bad / abusive behavior.
  • I actually expect that the most problematic aspects of this design will not even be identified until we see it in action.
  • I believe the most important areas of need will be based on further clarification needed on existing rules which become more consequential—QA scoring rubric being at the top of that list, as you allude to.

On the specific suggestions here:

I propose the mechanism under which wholly indivisible total reports are distributed to be strictly defined

Seems reasonable. We will come to an answer on this before the first validation session occurs.

As there is no upside for the judge in any case, I propose preventing judges from acting as validators altogether.

I think @CloudEllie and I addressed this somewhere -- our take is judges should be able to validate but not get paid for enhancements. There's a lot of overlapping work done by validators and judges, just like lookouts and judges, and it makes sense to let someone perform both roles, it just doesn't make sense for the judge to be able to enhance.

I propose that a Validator breaching their accuracy threshold renders the contest in review immediately, causing all "tainted" issues to be re-assessed by the other Validators as well as the Judge.

I think I agree with this, but can you explain what you're thinking in a little more detail? I'm not sure I understand exactly what you're thinking when you say "breaching accuracy threshold"

the rules of grading QA / Gas Reports is entirely up to the discretion of each judge. As such, having such subjective evaluations impact the accuracy of a Validator is unfair unless clear QA / Gas Report guidelines are set in place.

I think a better/clearer rubric will emerge here. But also, these are going to be subjective, period.

With regard to most of the remaining concerns, my view is that the first judge who handles these cases will begin to set precedent, we'll iterate that via case law, and ultimately come to a standard.

Hey @sockdrawermoney, really appreciate the attention C4 gives to our feedback! I am in alignment with most items stated in your response, and will address some of the things that require further clarification below:

I think @CloudEllie and I addressed this somewhere -- our take is judges should be able to validate but not get paid for enhancements. There's a lot of overlapping work done by validators and judges, just like lookouts and judges, and it makes sense to let someone perform both roles, it just doesn't make sense for the judge to be able to enhance.

I would like to advise analyzing if there has been any contest whatsoever whereby the judge participated in it as a warden and judged it on top. There is an inherent bias when you have created/improved a finding as you already consider it valid when doing so. I do not believe this is a huge blocker, but I believe that there is no precedent (i.e. same person being a warden and a judge) to expect the same guidelines to work for the validator role. I am happy to be proved wrong, however!

I think I agree with this, but can you explain what you're thinking in a little more detail? I'm not sure I understand exactly what you're thinking when you say "breaching accuracy threshold"

A Validator will have to meet some eligibility criteria when applying/being chosen to be a validator for a particular contest, including accuracy. If those criteria are no longer met as a result of the Validator's work in an active contest, all the findings they have touched should undergo a manual re-evaluation to ensure the Validator's inaccuracy has not compromised the contest's integrity. This re-evaluation should encompass all findings including those discarded at the validator repository.

The above will act as a deterrent for "fresh" validators with no history whereby many mistakes will cause their percentage threshold to be "breached" and will effectively prevent a scenario whereby a validator waits until a high-reward contest comes up to apply and compromise it.

I am in no way questioning the integrity of existing contest staff (i.e. judges, presorters, validators), and am simply trying to come up with a contingency scenario in case the low likelihood event of a nefarious party participating in the contest process presents itself. I also believe this is a strong indicator of "combat readiness" to external observers who are unaware of C4's inner workings and strict criteria for becoming contest staff.

I would like to advise analyzing if there has been any contest whatsoever whereby the judge participated in it as a warden and judged it on top. There is an inherent bias when you have created/improved a finding as you already consider it valid when doing so. I do not believe this is a huge blocker, but I believe that there is no precedent (i.e. same person being a warden and a judge) to expect the same guidelines to work for the validator role. I am happy to be proved wrong, however!

@alex-ppg It has indeed happened, but it's been a very long time since it did. I don't have precise examples at my fingertips, but some of our more long-tenured judges such as @GalloDaSballo, and perhaps @0xean, have been around long enough to recall a time when we had fewer judges available and occasionally needed people to do double duty. It's certainly not ideal, but we have always endeavoured to balance pragmatism with idealism.

I propose that a Validator breaching their accuracy threshold renders the contest in review immediately, causing all "tainted" issues to be re-assessed by the other Validators as well as the Judge.

A Validator will have to meet some eligibility criteria when applying/being chosen to be a validator for a particular contest, including accuracy. If those criteria are no longer met as a result of the Validator's work in an active contest, all the findings they have touched should undergo a manual re-evaluation to ensure the Validator's inaccuracy has not compromised the contest's integrity. This re-evaluation should encompass all findings including those discarded at the validator repository.

Let me parse this in my own words and think through how we might make this work:

  1. Validators' accuracy ratings should be monitored regularly
  2. Whenever a Validator's accuracy rating falls below the acceptable threshhold, their Validator role should be immediately suspended and
    1. any work in progress paused
    2. any active validation progress "reset", i.e. other Validators would be asked to re-validate all findings previously assigned to the disqualified Validator.

Does that match what you're proposing?

Note that I'm not certain that we can commit to this immediately, but I do want to understand and explore it, as I can clearly see the value. What I can certainly imagine in the immediate term is handling the Validator RSVPs for high-reward audits as special cases, where staff would hand-select a group of Validators from the qualified pool based on accuracy stats.

Thanks for contributing further @CloudEllie, once again appreciate the focus given on the matter! I hope more judges contribute to this thread as well and refute my statements if need be.

It has indeed happened, but it's been a very long time since it did.

I see, in that case, I retract my comment and the present guidelines are adequate and would require change only if an incident occurs.

Does that match what you're proposing?

It does so precisely, and handling high-reward audit Validator RSVPs manually is also a good countermeasure. I believe I have gotten my point across as I wanted to, and the responsibility of an iteration of my feedback now lies in C4s hands.

I do not have any further concerns to raise at this moment and will revisit this thread if need be after the feature goes live and I have personally experienced it as either a Validator or a Judge.

Hugely appreciate you sharpening our thinking here, @alex-ppg 🙏

@alex-ppg you're a gem, as always. thank you!

I think if validator decides to enhance a report, in validation repo, there should be a label labelling report as enhanced.

It would also be good to link the enhanced report to original report for judge and other security researcher.

otherwise, the auditor may have to going through each judged submission and go back and forth between the validation repo and the finding repo to check if their report is enhanced and how much enhancement are made.

I think if validator decides to enhance a report, in validation repo, there should be a label labelling report as enhanced.

@JeffCX All enhanced submissions will have the improved label. We're working on a way to link improved submissions back to their originals.

Citing my conversation with @CloudEllie from the SR channel here.

Issue

The changes to the original guidelines in this org issue above states that "Validators may not improve duplicates of submissions already present in the findings repo". The term findings repo here is not clear as to whether it includes the validation repo as well.

This is a problem since if the validation repo is not considered, we could have scenarios where proof from issue A in the validation repo could be used to improve issue B in the validation repo (assuming both issues share the same root cause).

Solution

Validators should be disallowed from improving submissions with duplicates in the validation repo (unless all issues in the group are missing something important and can be collectively improved).