squat/modulus

Nvidia drivers failing to compile with CL 1520.6.0

snsumner opened this issue · 5 comments

I have been testing Modulus with Nvidia drivers so I can use GPU scheduling on Tectonic. It was working just fine until CLUO upgraded my cluster to 1520.6.0. Now modulus fails to create compile build of Nvidia drivers.

David Michael said its because the build-1520 branch has a kernel not in 1520.6.0, since it will go in 1520.7.0. The Git branch has changes that were not built in a release yet. https://github.com/squat/modulus/blob/master/nvidia/compile#L8

Euan Kemp believe you are doing the wrong thing here and suggest I submit a issue. Given it's already in the dev container then I think, it should be pretty easy to get the right version. The repo should be handy and everything.

David Michael suggested I make the script find the release commit in the manifest and check out that commit instead of the branch: https://github.com/coreos/manifest/blob/v1520.6.0/release.xml#L23

I'm not a developer so I'm struggling to figure out how to workaround this issue. Greatly appreciate if you could fix your code so it will work with the latest CL version.

squat commented

@snsummer thanks for submitting this. I've noticed this issue for several months and builds have been able to run nonetheless, so I am not quite sure how this suddenly became a breaking issue. From your description I understand that:

  1. the compile script is checking out the release branch, but this is not correct; instead
  2. the compile script should checkout the git ref of the release, which can be found in the release manifest.

Am I reading reading that right? I'll start working on getting this working.

Thanks!

Yes, thats my understanding according to Euan and Michael but I'm not a coder so I don't necessarily understand what they are talking about. Let me know once you have a fix and I'll test it out in my lab environment.

squat commented

@snsumner I applied the suggested fix locally and compared the build logs from before and after and found the same build error:

# emerge -gKq --jobs 4 --load-average 4 coreos-sources


!!! Error fetching binhost package info from 'http://builds.release.core-os.net/embargoed/devfiles/boards/amd64-usr/1548.2.0/pkgs/'
!!! HTTP Error 403: Forbidden



!!! Error fetching binhost package info from 'http://builds.release.core-os.net/embargoed/devfiles/boards/amd64-usr/1548.2.0/toolchain/'
!!! HTTP Error 403: Forbidden

Unable to unshare: EPERM
Unable to unshare: EPERM
Unable to unshare: EPERM

emerge: there are no binary packages to satisfy "coreos-sources".

emerge: searching for similar names...
emerge: Maybe you meant any of these: sys-kernel/coreos-modules, coreos-base/coreos-dev, coreos-base/coreos-au-key?

Since this did not work, I adjusted the compile script to match the instructions documented upstream in coreos/docs@6620632, and pushed this to master, however I continue to see the same error.

@snsumner can you please show me the logs from the failed kernel module builds? If you see the same failures then I suspect the issue is a different one, or that the upstream docs may have an error as well.

squat commented

@snsumner it looks like builds on beta and alpha are once again working. I am currently testing stable as well. Also, now that coreos/dev-util#22 has merged, we should not need to manually set the revision of coreos-overlay or portage-stable, e.g. kubernetes-retired/kube-aws#985.

squat commented

@snsumner, stable, beta, and alpha are all working so I am closing this issue for now. Please re-open if you continue to see these issues.