[BUG]: devcontainers failing to compile program.
Opened this issue · 8 comments
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Compile-time Error
Component
Not sure
Describe the bug
The devcontainers currently rely on sccache
. For people who are not employees of NVIDIA you will not be able to rely on the cloud so obviously sccache should default to the home directory. However with the build scripts, what happens is that sccache
will try to remotely connect to the server, fail, and then stop compiling.
Here's a modification to the devcontainer.json
I did that fixes the problem, but this isn't a permanent solution and is more something I did on my end because I just wanted to get set up right away.
"shutdownAction": "stopContainer",
"image": "rapidsai/devcontainers:23.12-cpp-gcc12-cuda12.2-ubuntu22.04",
"hostRequirements": {
"gpu": true
},
"initializeCommand": [
"/bin/bash",
"-c",
"mkdir -m 0755 -p ${localWorkspaceFolder}/.{cache,config,sccache}"
],
"containerEnv": {
"SCCACHE_DIR": "${containerWorkspaceFolder}/.sccache",
"HISTFILE": "${containerWorkspaceFolder}/.cache/._bash_history",
"DEVCONTAINER_NAME": "cuda12.2-gcc12",
"CCCL_CUDA_VERSION": "12.2",
"CCCL_HOST_COMPILER": "gcc",
"CCCL_HOST_COMPILER_VERSION": "12",
"CCCL_BUILD_INFIX": "cuda12.2-gcc12",
"CXX": "g++",
"CUDAHOSTCXX": "g++"
},
"workspaceFolder": "/home/coder/${localWorkspaceFolderBasename}",
"workspaceMount": "source=${localWorkspaceFolder},target=/home/coder/${localWorkspaceFolderBasename},type=bind,consistency=consistent",
"mounts": [
"source=${localWorkspaceFolder}/.cache,target=/home/coder/.cache,type=bind,consistency=consistent",
"source=${localWorkspaceFolder}/.config,target=/home/coder/.config,type=bind,consistency=consistent"
],
"customizations": {
"vscode": {
"extensions": [
"llvm-vs-code-extensions.vscode-clangd",
"xaver.clang-format"
],
"settings": {
"editor.defaultFormatter": "xaver.clang-format",
"clang-format.executable": "/usr/local/bin/clang-format",
"clangd.arguments": [
"--compile-commands-dir=${workspaceFolder}"
]
}
}
},
"name": "cuda12.2-gcc12"
}
As you can see I am removing the cloud from the picture and just configuring it to my local.
How to Reproduce
- Try to compile with a dev container while not connected to any of NVIDIA's access packages.
Expected behavior
CMake Error at /usr/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
The C++ compiler
"/usr/bin/gcc"
is not able to compile a simple test program.
It fails with the following output:
Change Dir: '/home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD'
Run Build Command(s): /usr/local/bin/ninja -v cmTC_5f59e
[1/2] /usr/bin/sccache /usr/bin/gcc -o CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o -c /home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD/testCXXCompiler.cxx
FAILED: CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o
/usr/bin/sccache /usr/bin/gcc -o CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o -c /home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD/testCXXCompiler.cxx
sccache: error: Timed out waiting for server startup. Maybe the remote service is unreachable?
Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information
ninja: build stopped: subcommand failed.
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt:17 (project)
And when you run with
SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1
the following output appears:
"/home/coder/.config/sccache/config"
[2023-10-22T16:12:45Z DEBUG sccache::config] Couldn't open config file: failed to open file `/home/coder/.config/sccache/config`
[2023-10-22T16:12:45Z DEBUG sccache::config] Attempting to read config file at "/home/coder/.config/sccache/config"
[2023-10-22T16:12:45Z DEBUG sccache::config] Couldn't open config file: failed to open file `/home/coder/.config/sccache/config`
[2023-10-22T16:12:45Z INFO sccache::server] start_server: port: 4226
[2023-10-22T16:12:45Z INFO sccache::server] No scheduler address configured, disabling distributed sccache
[2023-10-22T16:12:45Z DEBUG sccache::cache::cache] Init s3 cache with bucket rapids-sccache-devs, endpoint None
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend build started: Builder { root: None, bucket: "rapids-sccache-devs", endpoint: None, region: Some("us-east-2"), .. }
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use root /
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use bucket rapids-sccache-devs
[2023-10-22T16:12:45Z DEBUG reqsign::aws::config] load_via_profile_config_file failed: No such file or directory (os error 2)
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
[2023-10-22T16:12:45Z DEBUG reqsign::aws::config] load_via_profile_shared_credentials_file failed: No such file or directory (os error 2)
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use region: us-east-2
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use endpoint: https://s3.us-east-2.amazonaws.com/rapids-sccache-devs
[2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend build finished
[2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=metadata -> started
[2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=metadata -> finished: AccessorInfo { scheme: S3, root: "/", name: "rapids-sccache-devs", capability: { Stat | Read | Write | CreateDir | Delete | Copy | List | Presign | Batch } }
[2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=read path=.sccache_check range=0- -> started
[2023-10-22T16:12:45Z DEBUG reqwest::connect] starting new connection: http://169.254.169.254/
[2023-10-22T16:12:45Z DEBUG hyper::client::connect::http] connecting to 169.254.169.254:80
sccache: error: Timed out waiting for server startup. Maybe the remote service is unreachable?
Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information
[2023-10-22T16:14:58Z DEBUG reqsign::aws::credential] load credential via imds_v2 failed: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Operation timed out (os error 110)
Caused by:
0: error trying to connect: tcp connect error: Operation timed out (os error 110)
1: tcp connect error: Operation timed out (os error 110)
2: Operation timed out (os error 110)
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
8: <unknown>
9: <unknown>
10: <unknown>
11: <unknown>
12: <unknown>
13: <unknown>
14: <unknown>
15: <unknown>
16: <unknown>
17: <unknown>
18: <unknown>
19: <unknown>
20: <unknown>
21: <unknown>
22: <unknown>
23: <unknown>
[2023-10-22T16:14:58Z WARN opendal::services] service=s3 operation=read path=.sccache_check range=0- -> errored: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again
Context:
service: s3
path: .sccache_check
range: 0-
[2023-10-22T16:14:58Z ERROR sccache::server] storage check failed for: cache storage failed to read: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again
Context:
service: s3
path: .sccache_check
range: 0-
Stack backtrace:
0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
[2023-10-22T16:14:58Z DEBUG sccache::server] notify_server_startup(Err { reason: "cache storage failed to read: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again\n\nContext:\n service: s3\n path: .sccache_check\n range: 0-\n" })
sccache: error: No such file or directory (os error 2)
ninja: build stopped: subcommand failed.
```
### Reproduction link
_No response_
### Operating System
all dev containers
### nvidia-smi output
_No response_
### NVCC version
_No response_
@ZelboK in the meantime, an easier way for you to work around the problem is to run these commands inside the container:
unset SCCACHE_BUCKET
unset SCCACHE_REGION
sccache --stop-server
sccache --start-server
Basically, unset those envvars and then restart the sccache server. This will automatically switch to using your local file system for the cache.
Hey @ZelboK, the devcontainer startup scripts should be automatically detecting that you don't have read/write access to the bucket and automatically adjusting sccache configuration. We have tests for some of these scenarios, so it's difficult to say what's causing your problem specifically.
Can you give us step-by-step instructions on how to reproduce the situation?
Are you authenticating with github (even though your user won't provide s3 access)?
Are the startup scripts completing successfully?
What are the values of SCCACHE_BUCKET
, SCCACHE_REGION
, and SCCACHE_S3_NO_CREDENTIALS
after the startup scripts have finished running?
Hi,
I've never actually used devcontainers before contributing, so I apologize for any misunderstandings on my end. However I do recall at some point authenticating with Github, I didn't think much of it because I just wanted to get started. From what you've described it sounds like that could be the culprit.
I will get back to this issue later today when I finish work with more details. I figured the github authentication part was worth mentioning right now though.
@cwharris I have the same issue
Repro steps:
- git clone https://github.com/nvidia/cccl.git
- Open folder in VSCode, press "Reopen in Container" right away
- devcontainer setup in VSCode terminal - https://pastebin.com/mhLDhMDC
echo "$SCCACHE_BUCKET $SCCACHE_REGION $SCCACHE_S3_NO_CREDENTIALS"
rapids-sccache-devs us-east-2- build - https://pastebin.com/DqiRYkEX
- after unsetting env vars and restarting sccache server as @jrhemstad suggested, compilation is successful
We've pushed a release that contains a fix to at least one of the sccache env variable issues. Can you retry and let us know if the issue is resolved for you?
We've pushed a release that contains a fix to at least one of the sccache env variable issues. Can you retry and let us know if the issue is resolved for you?
It works for me now! cuda 12.3 gcc 12 is what i tried.