NVIDIA/cccl

[BUG]: devcontainers failing to compile program.

Opened this issue · 8 comments

ZelboK commented

Is this a duplicate?

Type of Bug

Compile-time Error

Component

Not sure

Describe the bug

The devcontainers currently rely on sccache. For people who are not employees of NVIDIA you will not be able to rely on the cloud so obviously sccache should default to the home directory. However with the build scripts, what happens is that sccache will try to remotely connect to the server, fail, and then stop compiling.

Here's a modification to the devcontainer.json I did that fixes the problem, but this isn't a permanent solution and is more something I did on my end because I just wanted to get set up right away.

  "shutdownAction": "stopContainer",
  "image": "rapidsai/devcontainers:23.12-cpp-gcc12-cuda12.2-ubuntu22.04",
  "hostRequirements": {
    "gpu": true
  },
  "initializeCommand": [
    "/bin/bash",
    "-c",
    "mkdir -m 0755 -p ${localWorkspaceFolder}/.{cache,config,sccache}"
  ],
  "containerEnv": {
    "SCCACHE_DIR": "${containerWorkspaceFolder}/.sccache",
    "HISTFILE": "${containerWorkspaceFolder}/.cache/._bash_history",
    "DEVCONTAINER_NAME": "cuda12.2-gcc12",
    "CCCL_CUDA_VERSION": "12.2",
    "CCCL_HOST_COMPILER": "gcc",
    "CCCL_HOST_COMPILER_VERSION": "12",
    "CCCL_BUILD_INFIX": "cuda12.2-gcc12",
    "CXX": "g++",
    "CUDAHOSTCXX": "g++"
  },
  "workspaceFolder": "/home/coder/${localWorkspaceFolderBasename}",
  "workspaceMount": "source=${localWorkspaceFolder},target=/home/coder/${localWorkspaceFolderBasename},type=bind,consistency=consistent",
  "mounts": [
    "source=${localWorkspaceFolder}/.cache,target=/home/coder/.cache,type=bind,consistency=consistent",
    "source=${localWorkspaceFolder}/.config,target=/home/coder/.config,type=bind,consistency=consistent"
  ],
  "customizations": {
    "vscode": {
      "extensions": [
        "llvm-vs-code-extensions.vscode-clangd",
        "xaver.clang-format"
      ],
      "settings": {
        "editor.defaultFormatter": "xaver.clang-format",
        "clang-format.executable": "/usr/local/bin/clang-format",
        "clangd.arguments": [
          "--compile-commands-dir=${workspaceFolder}"
        ]
      }
    }
  },
  "name": "cuda12.2-gcc12"
}

As you can see I am removing the cloud from the picture and just configuring it to my local.

How to Reproduce

  1. Try to compile with a dev container while not connected to any of NVIDIA's access packages.

Expected behavior

CMake Error at /usr/share/cmake-3.27/Modules/CMakeTestCXXCompiler.cmake:60 (message):
  The C++ compiler

    "/usr/bin/gcc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD'
    
    Run Build Command(s): /usr/local/bin/ninja -v cmTC_5f59e
    [1/2] /usr/bin/sccache /usr/bin/gcc    -o CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o -c /home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD/testCXXCompiler.cxx
    FAILED: CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o 
    /usr/bin/sccache /usr/bin/gcc    -o CMakeFiles/cmTC_5f59e.dir/testCXXCompiler.cxx.o -c /home/coder/cccl/build/cuda12.2-llvm16/cub-cpp17/CMakeFiles/CMakeScratch/TryCompile-Us9ZRD/testCXXCompiler.cxx
    sccache: error: Timed out waiting for server startup. Maybe the remote service is unreachable?
      Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information
    ninja: build stopped: subcommand failed.
    
  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:17 (project)

And when you run with
SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 the following output appears:

 "/home/coder/.config/sccache/config"
    [2023-10-22T16:12:45Z DEBUG sccache::config] Couldn't open config file: failed to open file `/home/coder/.config/sccache/config`
    [2023-10-22T16:12:45Z DEBUG sccache::config] Attempting to read config file at "/home/coder/.config/sccache/config"
    [2023-10-22T16:12:45Z DEBUG sccache::config] Couldn't open config file: failed to open file `/home/coder/.config/sccache/config`
    [2023-10-22T16:12:45Z INFO  sccache::server] start_server: port: 4226
    [2023-10-22T16:12:45Z INFO  sccache::server] No scheduler address configured, disabling distributed sccache
    [2023-10-22T16:12:45Z DEBUG sccache::cache::cache] Init s3 cache with bucket rapids-sccache-devs, endpoint None
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend build started: Builder { root: None, bucket: "rapids-sccache-devs", endpoint: None, region: Some("us-east-2"), .. }
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use root /
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use bucket rapids-sccache-devs
    [2023-10-22T16:12:45Z DEBUG reqsign::aws::config] load_via_profile_config_file failed: No such file or directory (os error 2)
        
        Stack backtrace:
           0: <unknown>
           1: <unknown>
           2: <unknown>
           3: <unknown>
           4: <unknown>
           5: <unknown>
           6: <unknown>
           7: <unknown>
    [2023-10-22T16:12:45Z DEBUG reqsign::aws::config] load_via_profile_shared_credentials_file failed: No such file or directory (os error 2)
        
        Stack backtrace:
           0: <unknown>
           1: <unknown>
           2: <unknown>
           3: <unknown>
           4: <unknown>
           5: <unknown>
           6: <unknown>
           7: <unknown>
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use region: us-east-2
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend use endpoint: https://s3.us-east-2.amazonaws.com/rapids-sccache-devs
    [2023-10-22T16:12:45Z DEBUG opendal::services::s3::backend] backend build finished
    [2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=metadata -> started
    [2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=metadata -> finished: AccessorInfo { scheme: S3, root: "/", name: "rapids-sccache-devs", capability: { Stat | Read | Write | CreateDir | Delete | Copy | List | Presign | Batch } }
    [2023-10-22T16:12:45Z DEBUG opendal::services] service=s3 operation=read path=.sccache_check range=0- -> started
    [2023-10-22T16:12:45Z DEBUG reqwest::connect] starting new connection: http://169.254.169.254/
    [2023-10-22T16:12:45Z DEBUG hyper::client::connect::http] connecting to 169.254.169.254:80
    sccache: error: Timed out waiting for server startup. Maybe the remote service is unreachable?
    Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information
    [2023-10-22T16:14:58Z DEBUG reqsign::aws::credential] load credential via imds_v2 failed: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Operation timed out (os error 110)
        
        Caused by:
            0: error trying to connect: tcp connect error: Operation timed out (os error 110)
            1: tcp connect error: Operation timed out (os error 110)
            2: Operation timed out (os error 110)
        
        Stack backtrace:
           0: <unknown>
           1: <unknown>
           2: <unknown>
           3: <unknown>
           4: <unknown>
           5: <unknown>
           6: <unknown>
           7: <unknown>
           8: <unknown>
           9: <unknown>
          10: <unknown>
          11: <unknown>
          12: <unknown>
          13: <unknown>
          14: <unknown>
          15: <unknown>
          16: <unknown>
          17: <unknown>
          18: <unknown>
          19: <unknown>
          20: <unknown>
          21: <unknown>
          22: <unknown>
          23: <unknown>
    [2023-10-22T16:14:58Z WARN  opendal::services] service=s3 operation=read path=.sccache_check range=0- -> errored: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again
        
        Context:
            service: s3
            path: .sccache_check
            range: 0-
        
    [2023-10-22T16:14:58Z ERROR sccache::server] storage check failed for: cache storage failed to read: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again
        
        Context:
            service: s3
            path: .sccache_check
            range: 0-
        
        
        Stack backtrace:
           0: <unknown>
           1: <unknown>
           2: <unknown>
           3: <unknown>
           4: <unknown>
           5: <unknown>
           6: <unknown>
           7: <unknown>
    [2023-10-22T16:14:58Z DEBUG sccache::server] notify_server_startup(Err { reason: "cache storage failed to read: PermissionDenied (temporary) at read => no valid credential found, please check configuration or try again\n\nContext:\n    service: s3\n    path: .sccache_check\n    range: 0-\n" })
    sccache: error: No such file or directory (os error 2)
    ninja: build stopped: subcommand failed.
    ```

### Reproduction link

_No response_

### Operating System

all dev containers

### nvidia-smi output

_No response_

### NVCC version

_No response_

Thanks @ZelboK. I had thought this fallback to the local filesystem would already be working, but I think you're the first external person to try it, so we haven't tested this yet ;)

@trxcllnt and @cwharris will look into it and we'll try to get it fixed asap.

@ZelboK in the meantime, an easier way for you to work around the problem is to run these commands inside the container:

unset SCCACHE_BUCKET
unset SCCACHE_REGION
sccache --stop-server
sccache --start-server

Basically, unset those envvars and then restart the sccache server. This will automatically switch to using your local file system for the cache.

Hey @ZelboK, the devcontainer startup scripts should be automatically detecting that you don't have read/write access to the bucket and automatically adjusting sccache configuration. We have tests for some of these scenarios, so it's difficult to say what's causing your problem specifically.

Can you give us step-by-step instructions on how to reproduce the situation?
Are you authenticating with github (even though your user won't provide s3 access)?
Are the startup scripts completing successfully?
What are the values of SCCACHE_BUCKET, SCCACHE_REGION, and SCCACHE_S3_NO_CREDENTIALS after the startup scripts have finished running?

ZelboK commented

Hi,

I've never actually used devcontainers before contributing, so I apologize for any misunderstandings on my end. However I do recall at some point authenticating with Github, I didn't think much of it because I just wanted to get started. From what you've described it sounds like that could be the culprit.

I will get back to this issue later today when I finish work with more details. I figured the github authentication part was worth mentioning right now though.

@cwharris I have the same issue
Repro steps:

  1. git clone https://github.com/nvidia/cccl.git
  2. Open folder in VSCode, press "Reopen in Container" right away
  3. devcontainer setup in VSCode terminal - https://pastebin.com/mhLDhMDC
  4. echo "$SCCACHE_BUCKET $SCCACHE_REGION $SCCACHE_S3_NO_CREDENTIALS" rapids-sccache-devs us-east-2
  5. build - https://pastebin.com/DqiRYkEX
  6. after unsetting env vars and restarting sccache server as @jrhemstad suggested, compilation is successful

Thanks @bendyna! That's really helpful. We'll look into this ASAP.

We've pushed a release that contains a fix to at least one of the sccache env variable issues. Can you retry and let us know if the issue is resolved for you?

ZelboK commented

We've pushed a release that contains a fix to at least one of the sccache env variable issues. Can you retry and let us know if the issue is resolved for you?

It works for me now! cuda 12.3 gcc 12 is what i tried.