open-quantum-safe/liboqs

Trigger downstream liboqs-python CI is failing

dstebila opened this issue · 25 comments

Describe the bug
In recent liboqs CI builds on CircleCI, the "Trigger liboqs-python CI" step is failing.

To Reproduce
See https://app.circleci.com/pipelines/github/open-quantum-safe/liboqs/3710/workflows/34731b55-1e34-4510-bb20-bfdd484fa5d6/jobs/29103

I'm guessing it's somehow related to the changes involving oqs-bot and things not being configured correctly in https://github.com/open-quantum-safe/liboqs/blob/main/.circleci/config.yml#L264.

@ryjones Do you have any ideas about this?

Possibly it would be easier if we switched this job (to trigger downstream CI) over Github Actions...?

The issue is enterprises don't allow PATs to work like they used to. You have to create a GitHub app with a webhook. I'm looking into how to get this done.

Thanks Ry!

Also, if circle-ci doesn't offer anything over GitHub actions, it would make life easier if you moved it over.

Also, if circle-ci doesn't offer anything over GitHub actions, it would make life easier if you moved it over.

Agreed: We have a long-standing issue on this that no-one found time to work on (particularly, getting us ARM runners that were the sole reason why we didn't move off CCI): open-quantum-safe/oqs-provider#248 (oqs-provider typically leads liboqs in infrastructure updates which is why I create such issues first in that sub project as a "proving ground"). If you'd have time to work on this, we'd surely be happy. In that case, please also take a look at #1780 and all dependents.

OQS doesn't (yet) have access to ARM64. I don't have authorization to spend money on large runners, so I will need to huddle with Naomi and Hart to figure out what is authorized for this.

I don't have authorization to spend money

Thanks for the clear statement of limitation, @ryjones.

@dstebila Please help to have the PQCA-powers-that-be authorize this before the 0.11.0 release (created open-quantum-safe/tsc#25 to track): This stops the project from streamlining to GH actions (as recommended by LF employee and desired by OQS since a long time to be more efficient), otherwise requiring unnecessary work:

In order to deliver the 0.11.0 milestone, #1780 will need to support ARM64 CI as per PLATFORMS.md. Given the missing authorization above, the only way to facilitate that seems to be again investing in bespoke ARM64 CCI code.

Unpaid volunteers could consider it unfair or unsavory to do such inefficient or "throw-away work" to save money to an alliance funded by multi-billion-dollar-profit companies.

I personally found it OK to do such "work-around code" while OQS was a pure research project carried by voluntary contributors, but am unwilling to put in such effort to retain a mirage of a well-funded professional alliance, particularly as I'm personally annoyed seeing LF/PQCA processes forced onto OQS without any immediately visible offsetting benefits such as such suitable CI funding authorizations: I'd really be happy if PQCA were willing to spend a healthy portion of its funding on supporting development and not most on lawyers, marketing and executive travel.

FWIW, I did complete open-quantum-safe/ci-containers#84 to lay the foundation for hitting the 0.11.0 goal but for the reasons above will not write further CCI code going forward (beyond the one in the PR above to test the Dockerfile).

To be clear, access to the ARM64 runners is blocked by two things: money and approval from GitHub. I will push on the GitHub angle.

My understanding is that with our current pqca structure we could raise the request for funding of ARM runners at the PQCA TAC, then potentially they could raise a request with the governing board for funding?

I don't know exactly what the scope is here, but there should be some budget? For our projects, access to supported arm64 runners would seem to be very beneficial in reducing workload, and I wouldn't imagine the usage is too intense.

Is it worth figuring out how much resource we think we might need so that we could provide some kind of cost estimate based on github's published figures?

Given we have arm code in pq-code-package too it could be useful there (currently using QEMU) - I can float the idea there.

Is using a regular running with QEMU a viable fallback? (can be very slow...)

Given we have arm code in pq-code-package too it could be useful there (currently using QEMU) - I can float the idea there.

We also have the goal to not destroy earth's resources uselessly. Using QEMU is a clear case of that: Why run CPUs for hours if you can do the same thing in seconds on "proper" CPUs?

For the purposes of showing that all would work on GH, I already implemented this as "proof of concept", e.g., see test run in action here -- but with a very bad ecological conscience as per the above.

For our projects, access to supported arm64 runners would seem to be very beneficial in reducing workload, and I wouldn't imagine the usage is too intense.

Completely agree. Should be a no-brainer. (The promise for) Getting this (access to such resources) was also one of the reasons why I withdrew my objections to the LF take-over of OQS.

I have requested that pqcp and oqs get access to the ARM runners. The issue is they enter public beta in a few weeks, so they have been slow to approve new access requests.
Here is a copy of the request I raised yesterday.

Please add two orgs to the beta; please add three users to support them

Please add these orgs of which I am an owner:
https://github.com/pq-code-package
https://github.com/open-quantum-safe

Please add these users to the beta org:
baentsch
bhess
SWilson4
planetf1

The same issue appears when triggering oqs-provider downstream tests (using Github Actions):
https://github.com/open-quantum-safe/liboqs/actions/runs/9076079554/job/24938031071

@ryjones thanks for requesting access again. I had assumed there will still be fees for using the arm runners once public. Maybe that concern is misplaced and some usage will be supported on the free tier. Do we know any more yet?

some usage will be supported on the free tier

As I wrote above, "some usage" may already be working for non-commercial projects. It's just taking ages to complete: 10min for x64 and 100mins for aarch64 as per the log I referenced. Possibly using QEMU I added to be safe should the ARM64 runners not, well, run. But conceptually the "test GH job" I have created for that purpose should use real HW (unless I did sth real wrong -- please check).

In a stroke of good fortune, the PQCA board call is right after the PQCA TAC call next week. Given the data @baentsch has provided, I should be able to have a reasonable request to make.

For example, at Hyperledger, we spend about $2000 a month (more or less) on GitHub large runners, including arm. I imagine PQCA as a whole will be less than that for at least a year or two.

Having looked at all available CircleCI data, OQS would have spent ~$82 since June of 2023 on ARM64 runners, had they been available. All of the other usage seems to fall in the free tier for GitHub.

Thanks for this assessment @ryjones -- but please note that OQS has been skipping constant time testing on ARM64. This is a very debatable limitation that IMO should be improved on given ARM64 is now a formally supported tier 1 platform and --unlike Hyperledger-- OQS conceptually is a security software library that should have such (time-intensive) testing, particularly as/if people should begin to trust it in real world applications also on that platform. In addition, OQS is currently not doing a lot of other time-intensive testing that it should (fuzzing, etc.).

All told, I hope you can put (substantially) more than $82 into your annual budget for this: It would save (at least myself) quite a bit of effort to continue to work around this limitation. Also please do not (have LF/PQCA) consider offsetting my work at 0-cost given I am "0-cost"/a volunteer....

I plan to ask for $2000 a month, to cover workload expansion. With the exception of the ARM64 jobs, I think GitHub's current free runners should be able to do substantially all of the CI work; you could move them over at your leisure.

Even if we don't get into the beta, one option would be to sign up for BuildJet, which Hyperledger used for a while.

My interpretation of the sequence leading up to the CI failure (github): (cc: @ryjones )

The test that fails is triggered by

oqs-provider-release-test:
(well, in main).

this then seems to generate an event on the liboqs repo

https://github.com/open-quantum-safe/liboqs/blob/a5ec23cf19763d36a558b8358345823ae45d57e5/scripts/provider-test-trigger.sh

This is a manual ‘dispatches’ event, but against the oqs-provider repo — so it’s effectively triggering tests there

The workflow https://github.com/search?q=repo%3Aopen-quantum-safe%2Foqs-provider%20liboqs-release&type=code is then run

which then run tests https://github.com/open-quantum-safe/oqs-provider/blob/main/scripts/release-test-ci.sh

My interpretation of the sequence leading up to the CI failure (github): (cc: @ryjones )

The test that fails is triggered by

oqs-provider-release-test:

(well, in main).

this then seems to generate an event on the liboqs repo

@planetf1 Apologies if I'm misinterpreting what you wrote, but just to clarify: the downstream tests are not failing. The failures are due to permissions issues with the token that we use to trigger the downstream tests. Even if the downstream tests were failing, it would not cause the upstream workflows to "go red": the upstream workflow checks the GitHub API response code, which only indicates whether the downstream workflow was triggered successfully, not whether it completed successfully.

The infrastructure that's currently failing is mostly my work (#1507, open-quantum-safe/liboqs-python#65, open-quantum-safe/oqs-provider#345, #1654). My understanding is that it broke when the OQS GitHub account was upgraded to "Enterprise", which changed what we can and can't do with personal access tokens. @ryjones Please let me know if there's anything I can do (within the permissions I have) to help with getting this to work again. I think I have a pretty good understanding of the moving parts involved with the different workflows.

@bhess @dstebila would it be OK if I forked the two repos within the oqs org so I can test out some actions? they would have different names, and be deleted after I'm done with them

Go for it!

The CI failures were occurring because oqs-bot didn't have sufficient permissions. (I'm guessing its permissions were lowered silently during the move to Enterprise or some other recent change.)

After https://github.com/open-quantum-safe/tsc/pull/30/files, liboqs main CI is green and the oqs-provider release test trigger works.