nebuly-ai/nos

mig-agent pod failure

likku123 opened this issue · 19 comments

Hi,

I am seeing the below error in nebuly-nos-nebuly-nos-mig-agent pod.

{"level":"info","ts":1678537855.2905262,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1678537855.2921324,"logger":"setup","msg":"Initializing NVML client"}
{"level":"info","ts":1678537855.2921576,"logger":"setup","msg":"Checking MIG-enabled GPUs"}
{"level":"info","ts":1678537855.450721,"logger":"setup","msg":"Cleaning up unused MIG resources"}
{"level":"error","ts":1678537855.5242505,"logger":"setup","msg":"unable to initialize agent","error":"[code: generic err: unable to get allocatable resources from Kubelet gRPC socket: rpc error: code = Unimplemented desc = unknown method GetAllocatableResources for service v1.PodResourcesLister]","stacktrace":"main.main\n\t/workspace/migagent.go:119\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Please let me know what information is needed from my side and also any pointers on why we get this error will be greatly appreciated.

Thanks

Also I am seeing the below error in nvidia-device-plugin-daemonset pod

2023/03/12 02:49:25 Starting FS watcher.
2023/03/12 02:49:25 Starting OS watcher.
2023/03/12 02:49:25 Starting Plugins.
2023/03/12 02:49:25 Loading configuration.
2023/03/12 02:49:25 Updating config with default resource matching patterns.
2023/03/12 02:49:25
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "mixed",
"failOnInitError": true,
"nvidiaDriverRoot": "/run/nvidia/driver",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "uuid"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "1g.10gb",
"name": "nvidia.com/mig-1g.10gb"
},
{
"pattern": "2g.20gb",
"name": "nvidia.com/mig-2g.20gb"
},
{
"pattern": "3g.40gb",
"name": "nvidia.com/mig-3g.40gb"
},
{
"pattern": "4g.40gb",
"name": "nvidia.com/mig-4g.40gb"
},
{
"pattern": "7g.80gb",
"name": "nvidia.com/mig-7g.80gb"
},
{
"pattern": "1g.10gb+me",
"name": "nvidia.com/mig-1g.10gb.me"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2023/03/12 02:49:25 Retreiving plugins.
2023/03/12 02:49:25 Detected NVML platform: found NVML library
2023/03/12 02:49:25 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/03/12 02:49:25 Error: error starting plugins: error getting plugins: unable to load resource managers to manage plugin devices: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

Any help or pointers is highly appreciated.

Thank you.

Hi @likku123, thank you for raising the issue! Could you please provide which Kubernetes version are you using?

Regarding your first point, it seems like the MIG agent is not able to fetch the available resources from the Kubelet. It does that by using the GetAllocatableResources gRPC endpoint, which is available starting from k8s v1.23. Older k8s versions do not provide this endpoint, so the MIG agent won't be able to work.

The error log you are getting from the nvidia-device-plugin DaemonSet occurs when there is a GPU with MIG mode enabled but without any MIG device. This is because the nvidia-device-plugin expects that all MIG-enabled GPUs must have at least one MIG device, otherwise it considers the configuration invalid. This is the expected behaviour it does not prevent nos from creating the MIG devices on that node. As soon as nos creates the first MIG device, the device plugin is restarted and the configuration is considered valid.

We should update the documentation with this information. Thanks for bringing this to my attention!

Hi @Telemaco019,

Kubernetes version I am using right now is v1.20.8.

Any other workaround to make the MIG agent up in k8s v1.20.8?

Also,when I try to spin up a pod with below requests in yaml

limits:
nvidia.com/mig-1g.10gb: 1

I see the below error in nebuly-nos-nebuly-nos-gpu-partitioner pod.

"error":"model "NVIDIA-A100-SXM4-80GB" is not associated with any known GPU","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234"}

Are these errors occurring because of MIG-agent failure?

Thanks

The MIG Agent requires that endpoint for retrieving the MIG devices exposed on each node, each with its device ID and its status (e.g. used or free) so that nos can use this information for choosing the right partitioning state. Unfortunately, I don't see any other way for getting that information right now. I can only suggest upgrading k8s to a supported version starting from 1.23.

Regarding the error you are getting from the GPU Partitioner - It is not related to the MIG-agent failure. The issue is that nos does not know the set of available MIG geometries for that specific GPU model, as 100-SMX4-80GB is not included in the default value of knownMigGeometries in the Helm chart.

I'll include that model too in the next release, thank you for spotting the issue! At the moment, you can quickly fix the issue by customizing the available MIG geometries configuration as described here. Specifically, you can do that by providing the following values.yaml to the Helm installation chart:

knownMigGeometries:
  - models: [ "A30" ]
    allowedGeometries:
      - 1g.6gb: 4
      - 1g.6gb: 2
        2g.12gb: 1
      - 2g.12gb: 2
      - 4g.24gb: 1
  - models: [ "A100-SXM4-40GB", "NVIDIA-A100-40GB-PCIe" ]
    allowedGeometries:
      - 1g.5gb: 7
      - 1g.5gb: 5
        2g.10gb: 1
      - 1g.5gb: 3
        2g.10gb: 2
      - 1g.5gb: 1
        2g.10gb: 3
      - 1g.5gb: 2
        2g.10gb: 1
        3g.20gb: 1
      - 2g.10gb: 2
        3g.20gb: 1
      - 1g.5gb: 3
        3g.20gb: 1
      - 1g.5gb: 1
        2g.10gb: 1
        3g.20gb: 1
      - 3g.20gb: 2
      - 1g.5gb: 3
        4g.20gb: 1
      - 1g.5gb: 1
        2g.10gb: 1
        4g.20gb: 1
      - 7g.40gb: 1
  - models: [ "NVIDIA-A100-SXM4-80GB", "NVIDIA-A100-80GB-PCIe" ]
    allowedGeometries:
      - 1g.10gb: 7
      - 1g.10gb: 5
        2g.20gb: 1
      - 1g.10gb: 3
        2g.20gb: 2
      - 1g.10gb: 1
        2g.20gb: 3
      - 1g.10gb: 2
        2g.20gb: 1
        3g.40gb: 1
      - 2g.20gb: 2
        3g.20gb: 1
      - 1g.10gb: 3
        3g.40gb: 1
      - 1g.10gb: 1
        2g.20gb: 1
        3g.40gb: 1
      - 3g.40gb: 2
      - 1g.10gb: 3
        4g.40gb: 1
      - 1g.10gb: 1
        2g.20gb: 1
        4g.40gb: 1
      - 7g.79gb: 1

Hope this helps!

@Telemaco019

Thanks a lot for detail explanation. I think only option left now is to upgrade k8s to 1.23 . I'll try to work on that part and will let you know if I find anything new.

Thanks.

@Telemaco019

I have updated to k8s 1.23 version and now I don't see any errors.

Basically I am deploying nebuly-nos via fluxcd. One thing I have observed is what ever I have update in values.yaml file it's not getting updated.

For example I have implemented your suggestion of adding model ( NVIDIA-A100-SXM4-80GB) model to values.yaml but when I deploy it it's actually not reflecting the changes.

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: nebuly-nos
  namespace: flux-system
spec:
  interval: 5m
  targetNamespace: nebuly-nos
  chart:
    spec:
      chart: nos
      version: '0.1.0'
      sourceRef:
        kind: HelmRepository
        name: nebuly-nos-charts
        namespace: flux-system
      interval: 5m
      reconcileStrategy: Revision
  install:
    remediation:
      retries: 4
  upgrade:
    remediation:
      remediateLastFailure: True
  values:
    knownMigGeometries:
      - models: [ "A30" ]
        allowedGeometries:
          - 1g.6gb: 4
          - 1g.6gb: 2
            2g.12gb: 1
          - 2g.12gb: 2
          - 4g.24gb: 1
      - models: [ "A100-SXM4-40GB", "NVIDIA-A100-40GB-PCIe" ]
        allowedGeometries:
          - 1g.5gb: 7
          - 1g.5gb: 5
            2g.10gb: 1
          - 1g.5gb: 3
            2g.10gb: 2
          - 1g.5gb: 1
            2g.10gb: 3
          - 1g.5gb: 2
            2g.10gb: 1
            3g.20gb: 1
          - 2g.10gb: 2
            3g.20gb: 1
          - 1g.5gb: 3
            3g.20gb: 1
          - 1g.5gb: 1
            2g.10gb: 1
            3g.20gb: 1
          - 3g.20gb: 2
          - 1g.5gb: 3
            4g.20gb: 1
          - 1g.5gb: 1
            2g.10gb: 1
            4g.20gb: 1
          - 7g.40gb: 1
      - models: [ "NVIDIA-A100-SXM4-80GB", "NVIDIA-A100-80GB-PCIe" ]
        allowedGeometries:
          - 1g.10gb: 7
          - 1g.10gb: 5
            2g.20gb: 1
          - 1g.10gb: 3
            2g.20gb: 2
          - 1g.10gb: 1
            2g.20gb: 3
          - 1g.10gb: 2
            2g.20gb: 1
            3g.40gb: 1
          - 2g.20gb: 2
            3g.20gb: 1
          - 1g.10gb: 3
            3g.40gb: 1
          - 1g.10gb: 1
            2g.20gb: 1
            3g.40gb: 1
          - 3g.40gb: 2
          - 1g.10gb: 3
            4g.40gb: 1
          - 1g.10gb: 1
            2g.20gb: 1
            4g.40gb: 1
          - 7g.79gb: 1
kubectl describe configmap nebuly-nos-nebuly-nos-gpu-partitioner-known-mig-geometries
Name:         nebuly-nos-nebuly-nos-gpu-partitioner-known-mig-geometries
Namespace:    nebuly-nos
Labels:       app.kubernetes.io/component=gpu-partitioner
              app.kubernetes.io/instance=nebuly-nos-nebuly-nos
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nos-gpu-partitioner
              app.kubernetes.io/part-of=nos
              app.kubernetes.io/version=0.1.0
              helm.sh/chart=nos-0.1.0
              helm.toolkit.fluxcd.io/name=nebuly-nos
              helm.toolkit.fluxcd.io/namespace=flux-system
Annotations:  meta.helm.sh/release-name: nebuly-nos-nebuly-nos
              meta.helm.sh/release-namespace: nebuly-nos

Data
====
known_mig_geometries.yaml:
----
- allowedGeometries:
  - 1g.6gb: 4
  - 1g.6gb: 2
    2g.12gb: 1
  - 2g.12gb: 2
  - 4g.24gb: 1
  models:
  - A30
- allowedGeometries:
  - 1g.5gb: 7
  - 1g.5gb: 5
    2g.10gb: 1
  - 1g.5gb: 3
    2g.10gb: 2
  - 1g.5gb: 1
    2g.10gb: 3
  - 1g.5gb: 2
    2g.10gb: 1
    3g.20gb: 1
  - 2g.10gb: 2
    3g.20gb: 1
  - 1g.5gb: 3
    3g.20gb: 1
  - 1g.5gb: 1
    2g.10gb: 1
    3g.20gb: 1
  - 3g.20gb: 2
  - 1g.5gb: 3
    4g.20gb: 1
  - 1g.5gb: 1
    2g.10gb: 1
    4g.20gb: 1
  - 7g.40gb: 1
  models:
  - A100-SXM4-40GB
  - NVIDIA-A100-40GB-PCIe
- allowedGeometries:
  - 1g.10gb: 7
  - 1g.10gb: 5
    2g.20gb: 1
  - 1g.10gb: 3
    2g.20gb: 2
  - 1g.10gb: 1
    2g.20gb: 3
  - 1g.10gb: 2
    2g.20gb: 1
    3g.40gb: 1
  - 2g.20gb: 2
    3g.20gb: 1
  - 1g.10gb: 3
    3g.40gb: 1
  - 1g.10gb: 1
    2g.20gb: 1
    3g.40gb: 1
  - 3g.40gb: 2
  - 1g.10gb: 3
    4g.40gb: 1
  - 1g.10gb: 1
    2g.20gb: 1
    4g.40gb: 1
  - 7g.79gb: 1
  models:
  - NVIDIA-A100-80GB-PCIe

AM I missing anything here? Only difference I see I have implemented helm chart deployment via fluxcd using URLs and using OCI is my first time.

Thank you for all the help till now.

Hi @likku123, thank you for the follow-up! I'm glad to hear it works now with k8s 1.23 :)

There is a little typo in the values you are providing to the Helm chart: the knownMigGeometries field should be indented under the field gpuPartitioner, and not at the root level. The correct HelmRelease resource should look like this:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: nebuly-nos
  namespace: flux-system
spec:
  interval: 5m
  targetNamespace: nebuly-nos
  chart:
    spec:
      chart: nos
      version: '0.1.0'
      sourceRef:
        kind: HelmRepository
        name: nebuly-nos-charts
        namespace: flux-system
      interval: 5m
      reconcileStrategy: Revision
  install:
    remediation:
      retries: 4
  upgrade:
    remediation:
      remediateLastFailure: True
  values:
    gpuPartitioner:
      knownMigGeometries:
        - models: [ "A30" ]
          allowedGeometries:
            - 1g.6gb: 4
            - 1g.6gb: 2
              2g.12gb: 1
            - 2g.12gb: 2
            - 4g.24gb: 1
        - models: [ "A100-SXM4-40GB", "NVIDIA-A100-40GB-PCIe" ]
          allowedGeometries:
            - 1g.5gb: 7
            - 1g.5gb: 5
              2g.10gb: 1
            - 1g.5gb: 3
              2g.10gb: 2
            - 1g.5gb: 1
              2g.10gb: 3
            - 1g.5gb: 2
              2g.10gb: 1
              3g.20gb: 1
            - 2g.10gb: 2
              3g.20gb: 1
            - 1g.5gb: 3
              3g.20gb: 1
            - 1g.5gb: 1
              2g.10gb: 1
              3g.20gb: 1
            - 3g.20gb: 2
            - 1g.5gb: 3
              4g.20gb: 1
            - 1g.5gb: 1
              2g.10gb: 1
              4g.20gb: 1
            - 7g.40gb: 1
        - models: [ "NVIDIA-A100-SXM4-80GB", "NVIDIA-A100-80GB-PCIe" ]
          allowedGeometries:
            - 1g.10gb: 7
            - 1g.10gb: 5
              2g.20gb: 1
            - 1g.10gb: 3
              2g.20gb: 2
            - 1g.10gb: 1
              2g.20gb: 3
            - 1g.10gb: 2
              2g.20gb: 1
              3g.40gb: 1
            - 2g.20gb: 2
              3g.20gb: 1
            - 1g.10gb: 3
              3g.40gb: 1
            - 1g.10gb: 1
              2g.20gb: 1
              3g.40gb: 1
            - 3g.40gb: 2
            - 1g.10gb: 3
              4g.40gb: 1
            - 1g.10gb: 1
              2g.20gb: 1
              4g.40gb: 1
            - 7g.79gb: 1

Hope this helps!

(btw, I've updated the default values of the chart for including also the model NVIDIA-A100-SXM4-80GB in the allowed geometries, so with the next minor release 0.1.1 it won't be necessary to customize this value anymore)

@Telemaco019 Thanks a lot for your help and appreciate your guidance providing to me. I'll close this issue and will raise a new one if I face any more issues.

Hi @Telemaco019
Sorry to bother you again but right now I am stuck at a place where I don't have any clue to move forward.

I still see the below error after installing nos and also the pod going into crash loop back error.

kubectl logs nvidia-device-plugin-daemonset-md7t8 2023/03/28 15:49:23 Starting FS watcher. 2023/03/28 15:49:24 Starting OS watcher. 2023/03/28 15:49:24 Starting Plugins. 2023/03/28 15:49:24 Loading configuration. 2023/03/28 15:49:24 Updating config with default resource matching patterns. 2023/03/28 15:49:24 Running with config: { "version": "v1", "flags": { "migStrategy": "mixed", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": true, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "1g.10gb", "name": "nvidia.com/mig-1g.10gb" }, { "pattern": "2g.20gb", "name": "nvidia.com/mig-2g.20gb" }, { "pattern": "3g.40gb", "name": "nvidia.com/mig-3g.40gb" }, { "pattern": "4g.40gb", "name": "nvidia.com/mig-4g.40gb" }, { "pattern": "7g.80gb", "name": "nvidia.com/mig-7g.80gb" }, { "pattern": "1g.10gb+me", "name": "nvidia.com/mig-1g.10gb.me" } ] }, "sharing": { "timeSlicing": {} } } 2023/03/28 15:49:24 Retreiving plugins. 2023/03/28 15:49:24 Detected NVML platform: found NVML library 2023/03/28 15:49:24 Detected non-Tegra platform: /sys/devices/soc0/family file not found 2023/03/28 15:49:24 Error: error starting plugins: error getting plugins: unable to load resource managers to manage plugin devices: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

` kubectl describe pod nvidia-device-plugin-daemonset-md7t8
Name: nvidia-device-plugin-daemonset-md7t8
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Node: abc.xyz.com/10.0.0.24
Start Time: Tue, 28 Mar 2023 15:28:02 +0000
Labels: app=nvidia-device-plugin-daemonset
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=5b7fb6c7d4
helm.sh/chart=gpu-operator-v22.9.2
pod-template-generation=1
Annotations: kubernetes.io/psp: psp-no-restriction
Status: Running
IP: 100.96.170.250
IPs:
IP: 100.96.170.250
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID: containerd://427d3223b20d2c0b3d409f0b918249dfdf1a26545d768cc76c82623f1c5455e2
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:00f1476548fbed9ee01961443a73bf65396c2e8bb2b84426f949dd56cb4d14cd
Port:
Host Port:
Command:
sh
-c
Args:
until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 28 Mar 2023 15:28:04 +0000
Finished: Tue, 28 Mar 2023 15:28:04 +0000
Ready: True
Restart Count: 0
Environment:
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gqxzw (ro)
Containers:
nvidia-device-plugin:
Container ID: containerd://3c0850a8fb9cdcd8700b30a8a709c953ae9a0cab573ab9e14fa63e07384bcc83
Image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0-ubi8
Image ID: nvcr.io/nvidia/k8s-device-plugin@sha256:9c17d3a907eb77eb8f7b4f3faf52d8352e4252af92003f828083f80d629bd2c3
Port:
Host Port:
Command:
bash
-c
Args:
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 28 Mar 2023 15:54:34 +0000
Finished: Tue, 28 Mar 2023 15:54:34 +0000
Ready: False
Restart Count: 10
Environment:
PASS_DEVICE_SPECS: true
FAIL_ON_INIT_ERROR: true
DEVICE_LIST_STRATEGY: envvar
DEVICE_ID_STRATEGY: uuid
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
MIG_STRATEGY: mixed
NVIDIA_MIG_MONITOR_DEVICES: all
Mounts:
/run/nvidia from run-nvidia (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gqxzw (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: Directory
kube-api-access-gqxzw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.device-plugin=true
Tolerations: gpugate:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
ondemand:NoSchedule op=Exists
pre-release:NoSchedule op=Exists
reserved:NoSchedule op=Exists
Events:
Type Reason Age From Message


Normal Scheduled 31m default-scheduler Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-md7t8 to abc.xyz.com/10.0.0.24
Normal Pulled 31m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.2" already present on machine
Normal Created 31m kubelet Created container toolkit-validation
Normal Started 31m kubelet Started container toolkit-validation
Normal Started 30m (x4 over 31m) kubelet Started container nvidia-device-plugin
Normal Pulled 29m (x5 over 31m) kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.13.0-ubi8" already present on machine
Normal Created 29m (x5 over 31m) kubelet Created container nvidia-device-plugin
Warning BackOff 90s (x139 over 31m) kubelet Back-off restarting failed container
`

Am I missing any thing here ?

Is this similar to this bug you have raised #25 ?

Hello @likku123, no worries at all! And thank you for the detailed information!

Yes, it seems the problem is related to #25. However, if your node has just a single GPU, then these error logs should not prevent GPU partitioning from working properly. Even though initially the nvidia-device-plugin Pod crashes, when nos creates the requested MIG resource the device-plugin is restarted and it should advertise the new MIG resources correctly.

You can try to submit a Pod requesting some MIG resources: after at most ~30 seconds the requested resources should be created and the device-plugin Pod should run properly.

However, we have already fixed #25 on the branch main, so installing the latest version of the GPU partitioner container should solve the error logs you are seeing, as GPUs are now initialized with the largest available MIG devices. You can install the latest version by adding the following entry to your HelmRelease values:

gpuPartitioner:
    image:
        tag: latest

Let me know if you need any help!

Hi @Telemaco019,

Thank you for the suggestion.

I have updated the helm values with latest gpuPartitioner image but still I am see the below error in device plugin daemonset

2023/03/29 06:29:48 Starting FS watcher. 2023/03/29 06:29:48 Starting OS watcher. 2023/03/29 06:29:48 Starting Plugins. 2023/03/29 06:29:48 Loading configuration. 2023/03/29 06:29:48 Updating config with default resource matching patterns. 2023/03/29 06:29:48 Running with config: { "version": "v1", "flags": { "migStrategy": "mixed", "failOnInitError": true, "nvidiaDriverRoot": "/run/nvidia/driver", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": true, "deviceListStrategy": "envvar", "deviceIDStrategy": "uuid" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ], "mig": [ { "pattern": "1g.10gb", "name": "nvidia.com/mig-1g.10gb" }, { "pattern": "2g.20gb", "name": "nvidia.com/mig-2g.20gb" }, { "pattern": "3g.40gb", "name": "nvidia.com/mig-3g.40gb" }, { "pattern": "4g.40gb", "name": "nvidia.com/mig-4g.40gb" }, { "pattern": "7g.80gb", "name": "nvidia.com/mig-7g.80gb" }, { "pattern": "1g.10gb+me", "name": "nvidia.com/mig-1g.10gb.me" } ] }, "sharing": { "timeSlicing": {} } } 2023/03/29 06:29:48 Retreiving plugins. 2023/03/29 06:29:48 Detected NVML platform: found NVML library 2023/03/29 06:29:48 Detected non-Tegra platform: /sys/devices/soc0/family file not found 2023/03/29 06:29:48 Error: error starting plugins: error getting plugins: unable to load resource managers to manage plugin devices: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration
When I spin up a pod with MIG requirement.

`
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-1-likku
spec:
replicas: 2
selector:
matchLabels:
app: dummy
template:
metadata:
labels:
app: dummy
spec:
containers:
- name: sleepy
image: busybox:latest
command: ["sleep", "120"]
resources:
limits:
nvidia.com/mig-1g.10gb: 1
nodeSelector:
gputype: A100

I see the below messages when I spin up the pod and also device plugin daemon set get restart but thrown the same error.
Warning FailedScheduling 21s (x14 over 14m) default-scheduler 0/18 nodes are available: 1 Insufficient nvidia.com/mig-1g.10gb, 2 node(s) had taint {ephemeral: true}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 node(s) didn't match Pod's node affinity/selector, 7 node(s) had taint {nvidia.com/gpu: true}, that the pod didn't tolerate
`

I see thee below messages from gpu-partitioner pod

`
{"level":"info","ts":1680070903.601499,"msg":"3 out of 3 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"7f0d9b0a-c6d0-4b3c-88f0-7bddaca49208"}
{"level":"info","ts":1680070903.6024787,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-leela-7df94448b4-sdtxc","reconcileID":"7f0d9b0a-c6d0-4b3c-88f0-7bddaca49208","partitioning":{"DesiredState":{"lxjh820.phibred.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":7}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}}

`
GPU helm chart I am using is https://github.com/NVIDIA/gpu-operator/blob/v22.9.2/deployments/gpu-operator/values.yaml

Hi @likku123, thank you for providing such detailed information! It seems like the GPUs on your node are still not being initialized correctly, leading to the NVIDIA Device Plugin crashing due to MIG-mode enabled with no MIG devices.

I see from the GPU Partitioner logs that the GPUs with indexes from 1 to 7 are not initialized correctly. I've tried to replicate the issue on a Node with only 2 GPUs but I wasn't able to get the same result, as the GPU Partitioner correctly initialized all the GPUs with a single 7g.40 MIG device.

Could you please provide the hash of the GPU Partitioner Docker image you are running?
Also, to check whether the GPU Partitioner is performing GPU initialization, could you please try restarting the GPU Partitioner pods and checking the initial logs? You should see these two log messages:

- initializing MIG geometry
- node is not initialized yet, skipping

Hi @Telemaco019

Below are the details you have requested.

Image ID: ghcr.io/nebuly-ai/nos-gpu-partitioner@sha256:31a8754a14dd709ff3aa81fd9cd0ef8378438a4c068a73d8449bb0c77bc0c65f

Below are the logs after restarting the gpu-partitioner pod.

kubectl logs nebuly-nos-nebuly-nos-gpu-partitioner-86d78f585d-cz674 {"level":"info","ts":1680194589.4523365,"logger":"setup","msg":"using known MIG geometries loaded from file","geometries":[{"models":["A30"],"allowedGeometries":[{"1g.6gb":4},{"1g.6gb":2,"2g.12gb":1},{"2g.12gb":2},{"4g.24gb":1}]},{"models":["A100-SXM4-40GB","NVIDIA-A100-40GB-PCIe"],"allowedGeometries":[{"1g.5gb":7},{"1g.5gb":5,"2g.10gb":1},{"1g.5gb":3,"2g.10gb":2},{"1g.5gb":1,"2g.10gb":3},{"1g.5gb":2,"2g.10gb":1,"3g.20gb":1},{"2g.10gb":2,"3g.20gb":1},{"1g.5gb":3,"3g.20gb":1},{"1g.5gb":1,"2g.10gb":1,"3g.20gb":1},{"3g.20gb":2},{"1g.5gb":3,"4g.20gb":1},{"1g.5gb":1,"2g.10gb":1,"4g.20gb":1},{"7g.40gb":1}]},{"models":["NVIDIA-A100-SXM4-80GB","NVIDIA-A100-80GB-PCIe"],"allowedGeometries":[{"1g.10gb":7},{"1g.10gb":5,"2g.20gb":1},{"1g.10gb":3,"2g.20gb":2},{"1g.10gb":1,"2g.20gb":3},{"1g.10gb":2,"2g.20gb":1,"3g.40gb":1},{"2g.20gb":2,"3g.20gb":1},{"1g.10gb":3,"3g.40gb":1},{"1g.10gb":1,"2g.20gb":1,"3g.40gb":1},{"3g.40gb":2},{"1g.10gb":3,"4g.40gb":1},{"1g.10gb":1,"2g.20gb":1,"4g.40gb":1},{"7g.79gb":1}]}]} {"level":"info","ts":1680194590.438198,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"} {"level":"info","ts":1680194590.4400725,"logger":"setup","msg":"scheduler configured with default profile"} {"level":"info","ts":1680194590.4408388,"logger":"setup","msg":"pods batch window","timeout":"1m0s","idle":"10s"} {"level":"info","ts":1680194590.4409235,"logger":"setup","msg":"starting manager"} {"level":"info","ts":1680194590.441122,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"127.0.0.1:8080"} {"level":"info","ts":1680194590.4411697,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"} I0330 16:43:10.642240 1 leaderelection.go:248] attempting to acquire leader lease nebuly-nos/gpu-partitioner.nebuly.com... I0330 16:43:36.071479 1 leaderelection.go:258] successfully acquired lease nebuly-nos/gpu-partitioner.nebuly.com {"level":"info","ts":1680194616.0717614,"msg":"Starting EventSource","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node","source":"kind source: *v1.Node"} {"level":"info","ts":1680194616.0718849,"msg":"Starting Controller","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node"} {"level":"info","ts":1680194616.0719476,"msg":"Starting EventSource","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680194616.0719204,"msg":"Starting EventSource","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680194616.072034,"msg":"Starting Controller","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680194616.0720127,"msg":"Starting Controller","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680194616.0721114,"msg":"Starting EventSource","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680194616.0721612,"msg":"Starting Controller","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680194616.172476,"msg":"Starting workers","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node","worker count":10} {"level":"info","ts":1680194616.175679,"msg":"Starting workers","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod","worker count":1} {"level":"info","ts":1680194616.1779187,"msg":"Starting workers","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod","worker count":10} {"level":"info","ts":1680194616.1779795,"msg":"Starting workers","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","worker count":1} {"level":"info","ts":1680194626.3497434,"msg":"processing pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"f48cfe86-2734-45ba-90d4-537761866602"} {"level":"info","ts":1680194626.3504162,"msg":"found 3 pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"f48cfe86-2734-45ba-90d4-537761866602"} {"level":"info","ts":1680194626.350438,"msg":"3 out of 3 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"f48cfe86-2734-45ba-90d4-537761866602"} {"level":"info","ts":1680194626.3515482,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"f48cfe86-2734-45ba-90d4-537761866602","partitioning":{"DesiredState":{"lxjh820.phibred.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":2,"nvidia.com/mig-2g.20gb":1,"nvidia.com/mig-3g.40gb":1}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}} {"level":"info","ts":1680194626.3516874,"msg":"applying desired partitioning","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-sdtxc","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-sdtxc","reconcileID":"f48cfe86-2734-45ba-90d4-537761866602"}`

I am also sharing the node labels and annotations for your reference . Please let me know if you find any abnormality in it.

Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.CLZERO=true feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.IBS=true feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true feature.node.kubernetes.io/cpu-cpuid.INVLPGB=true feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true feature.node.kubernetes.io/cpu-cpuid.RDPRU=true feature.node.kubernetes.io/cpu-cpuid.SHA=true feature.node.kubernetes.io/cpu-cpuid.SSE4A=true feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpu-rdt.RDTCMT=true feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true feature.node.kubernetes.io/cpu-rdt.RDTMBM=true feature.node.kubernetes.io/cpu-rdt.RDTMON=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/custom-rdma.capable=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true feature.node.kubernetes.io/kernel-version.full=5.4.0-144-generic feature.node.kubernetes.io/kernel-version.major=5 feature.node.kubernetes.io/kernel-version.minor=4 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-10de.sriov.capable=true feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-1a03.present=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true gpumem=80 gputype=A100 kubernetes.io/arch=amd64 kubernetes.io/hostname=abc.xyz.com kubernetes.io/os=linux nos.nebuly.com/gpu-partitioning=mig nvidia.com/cuda.driver.major=525 nvidia.com/cuda.driver.minor=60 nvidia.com/cuda.driver.rev=13 nvidia.com/cuda.runtime.major=12 nvidia.com/cuda.runtime.minor=0 nvidia.com/gfd.timestamp=1680069683 nvidia.com/gpu-driver-upgrade-state=validation-required nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=0 nvidia.com/gpu.count=8 nvidia.com/gpu.deploy.container-toolkit=true nvidia.com/gpu.deploy.dcgm=true nvidia.com/gpu.deploy.dcgm-exporter=true nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.driver=true nvidia.com/gpu.deploy.gpu-feature-discovery=true nvidia.com/gpu.deploy.mig-manager=false nvidia.com/gpu.deploy.node-status-exporter=true nvidia.com/gpu.deploy.nvsm= nvidia.com/gpu.deploy.operator-validator=true nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=AS--4124GO-NART nvidia.com/gpu.memory=81920 nvidia.com/gpu.present=true nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB nvidia.com/gpu.replicas=0 nvidia.com/mig-1g.10gb.count=5 nvidia.com/mig-1g.10gb.engines.copy=1 nvidia.com/mig-1g.10gb.engines.decoder=0 nvidia.com/mig-1g.10gb.engines.encoder=0 nvidia.com/mig-1g.10gb.engines.jpeg=0 nvidia.com/mig-1g.10gb.engines.ofa=0 nvidia.com/mig-1g.10gb.memory=9728 nvidia.com/mig-1g.10gb.multiprocessors=14 nvidia.com/mig-1g.10gb.product=NVIDIA-A100-SXM4-80GB-MIG-1g.10gb nvidia.com/mig-1g.10gb.replicas=1 nvidia.com/mig-1g.10gb.slices.ci=1 nvidia.com/mig-1g.10gb.slices.gi=1 nvidia.com/mig-2g.20gb.count=1 nvidia.com/mig-2g.20gb.engines.copy=2 nvidia.com/mig-2g.20gb.engines.decoder=1 nvidia.com/mig-2g.20gb.engines.encoder=0 nvidia.com/mig-2g.20gb.engines.jpeg=0 nvidia.com/mig-2g.20gb.engines.ofa=0 nvidia.com/mig-2g.20gb.memory=19968 nvidia.com/mig-2g.20gb.multiprocessors=28 nvidia.com/mig-2g.20gb.product=NVIDIA-A100-SXM4-80GB-MIG-2g.20gb nvidia.com/mig-2g.20gb.replicas=1 nvidia.com/mig-2g.20gb.slices.ci=2 nvidia.com/mig-2g.20gb.slices.gi=2 nvidia.com/mig.capable=true nvidia.com/mig.strategy=mixed onprem=true Annotations: io.cilium.network.ipv4-cilium-host: 100.96.170.244 io.cilium.network.ipv4-health-ip: 100.96.170.43 io.cilium.network.ipv4-pod-cidr: 100.96.170.0/24 kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CLZERO,cpu-cpuid.CPBOOST,cpu-cpuid.FMA3,cpu-cpuid.IBS,cpu-cpuid.IBSBR... nfd.node.kubernetes.io/master.version: v0.10.1 nfd.node.kubernetes.io/worker.version: v0.10.1 node.alpha.kubernetes.io/ttl: 0 nos.nebuly.com/spec-gpu-0-1g.10gb: 2 nos.nebuly.com/spec-gpu-0-2g.20gb: 1 nos.nebuly.com/spec-gpu-0-3g.40gb: 1 nos.nebuly.com/spec-partitioning-plan: 1680194681 nos.nebuly.com/status-partitioning-plan: 1680194681 nvidia.com/gpu-driver-upgrade-enabled: true volumes.kubernetes.io/controller-managed-attach-detach: true

Mean time i'll try to revert back all the changes in gpu operator and see.

Thank you! for the support till now.

Thank you so much for the shared information, it's been very helpful.

Your node labels and annotations are correct, the problem is a silly bug in the GPU partitioner that prevents it from initializing the nodes when at least one GPU has already been partitioned. Fixing it should be quick, it's just a wrong if condition. I'll push the "latest" version with the fix ASAP.

Thanks again for helping troubleshoot the issue and spotting this bug! I'll update you as soon as the fix is available

Thank you ! @Telemaco019
I'll just wait for your confirmation before I proceed with any changes in GPU-operator.

Hi @likku123, the fix is now available on the "latest" version of the GPU Partitioner image. Pulling the new image should finally solve your problem. Thanks again for your help in spotting this bug!

Let me know if you need any help, and thank you for your patience :)

Hi @Telemaco019

Still no success for me :(
ghcr.io/nebuly-ai/nos-gpu-partitioner@sha256:31a8754a14dd709ff3aa81fd9cd0ef8378438a4c068a73d8449bb0c77bc0c65f

I am still getting the same errors. Let me know any more info is needed from my side.

kubectl logs nebuly-nos-nebuly-nos-gpu-partitioner-7c6bd668b5-jcc5p {"level":"info","ts":1680276067.5328996,"logger":"setup","msg":"using known MIG geometries loaded from file","geometries":[{"models":["A30"],"allowedGeometries":[{"1g.6gb":4},{"1g.6gb":2,"2g.12gb":1},{"2g.12gb":2},{"4g.24gb":1}]},{"models":["A100-SXM4-40GB","NVIDIA-A100-40GB-PCIe"],"allowedGeometries":[{"1g.5gb":7},{"1g.5gb":5,"2g.10gb":1},{"1g.5gb":3,"2g.10gb":2},{"1g.5gb":1,"2g.10gb":3},{"1g.5gb":2,"2g.10gb":1,"3g.20gb":1},{"2g.10gb":2,"3g.20gb":1},{"1g.5gb":3,"3g.20gb":1},{"1g.5gb":1,"2g.10gb":1,"3g.20gb":1},{"3g.20gb":2},{"1g.5gb":3,"4g.20gb":1},{"1g.5gb":1,"2g.10gb":1,"4g.20gb":1},{"7g.40gb":1}]},{"models":["NVIDIA-A100-SXM4-80GB","NVIDIA-A100-80GB-PCIe"],"allowedGeometries":[{"1g.10gb":7},{"1g.10gb":5,"2g.20gb":1},{"1g.10gb":3,"2g.20gb":2},{"1g.10gb":1,"2g.20gb":3},{"1g.10gb":2,"2g.20gb":1,"3g.40gb":1},{"2g.20gb":2,"3g.20gb":1},{"1g.10gb":3,"3g.40gb":1},{"1g.10gb":1,"2g.20gb":1,"3g.40gb":1},{"3g.40gb":2},{"1g.10gb":3,"4g.40gb":1},{"1g.10gb":1,"2g.20gb":1,"4g.40gb":1},{"7g.79gb":1}]}]} {"level":"info","ts":1680276068.5236146,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"} {"level":"info","ts":1680276068.5287971,"logger":"setup","msg":"scheduler configured with default profile"} {"level":"info","ts":1680276068.5309753,"logger":"setup","msg":"pods batch window","timeout":"1m0s","idle":"10s"} {"level":"info","ts":1680276068.5310457,"logger":"setup","msg":"starting manager"} {"level":"info","ts":1680276068.531701,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"} {"level":"info","ts":1680276068.5317254,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"127.0.0.1:8080"} I0331 15:21:08.632707 1 leaderelection.go:248] attempting to acquire leader lease nebuly-nos/gpu-partitioner.nebuly.com... I0331 15:21:25.357665 1 leaderelection.go:258] successfully acquired lease nebuly-nos/gpu-partitioner.nebuly.com {"level":"info","ts":1680276085.3579767,"msg":"Starting EventSource","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680276085.3580549,"msg":"Starting Controller","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680276085.358093,"msg":"Starting EventSource","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680276085.3581657,"msg":"Starting Controller","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680276085.3579955,"msg":"Starting EventSource","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node","source":"kind source: *v1.Node"} {"level":"info","ts":1680276085.3582487,"msg":"Starting Controller","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node"} {"level":"info","ts":1680276085.3583376,"msg":"Starting EventSource","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"} {"level":"info","ts":1680276085.3583982,"msg":"Starting Controller","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod"} {"level":"info","ts":1680276085.4595256,"msg":"Starting workers","controller":"mps-partitioner-controller","controllerGroup":"","controllerKind":"Pod","worker count":1} {"level":"info","ts":1680276085.4595923,"msg":"Starting workers","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","worker count":1} {"level":"info","ts":1680276085.4596114,"msg":"Starting workers","controller":"clusterstate-pod-controller","controllerGroup":"","controllerKind":"Pod","worker count":10} {"level":"info","ts":1680276085.4626248,"msg":"Starting workers","controller":"clusterstate-node-controller","controllerGroup":"","controllerKind":"Node","worker count":10} {"level":"info","ts":1680276095.8118343,"msg":"processing pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142"} {"level":"info","ts":1680276095.8125858,"msg":"found 5 pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142"} {"level":"info","ts":1680276095.8126106,"msg":"5 out of 5 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142"} {"level":"info","ts":1680276095.8137965,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142","partitioning":{"DesiredState":{"abc.xyz.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":7}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}} {"level":"info","ts":1680276095.8139107,"msg":"applying desired partitioning","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142"} {"level":"info","ts":1680276095.8140306,"msg":"partitioning node","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142","node":"abc.xyz.com","partitioning":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":7}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}} {"level":"info","ts":1680276095.8921742,"msg":"plan applied","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14105-1230399844","namespace":"likku"},"namespace":"likku","name":"exec-dev-14105-1230399844","reconcileID":"60f568ec-73bb-4ad3-a4f3-31280c16e142"} {"level":"info","ts":1680276130.7219198,"msg":"processing pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81"} {"level":"info","ts":1680276130.722643,"msg":"found 5 pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81"} {"level":"info","ts":1680276130.722669,"msg":"5 out of 5 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81"} {"level":"info","ts":1680276130.7264588,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81","partitioning":{"DesiredState":{"abc.xyz.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":5,"nvidia.com/mig-2g.20gb":1}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}} {"level":"info","ts":1680276130.7265663,"msg":"applying desired partitioning","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81"} {"level":"info","ts":1680276130.7267036,"msg":"partitioning node","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81","node":"abc.xyz.com","partitioning":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":5,"nvidia.com/mig-2g.20gb":1}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}} {"level":"info","ts":1680276130.8019152,"msg":"plan applied","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"deployment-1-likku-7df94448b4-8p6vf","namespace":"gpu-operator-resources"},"namespace":"gpu-operator-resources","name":"deployment-1-likku-7df94448b4-8p6vf","reconcileID":"e80ccb43-dcca-4c20-a751-d1cfb65ffb81"} {"level":"info","ts":1680276145.7220917,"msg":"processing pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad"} {"level":"info","ts":1680276145.722748,"msg":"found 5 pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad"} {"level":"info","ts":1680276145.7227693,"msg":"5 out of 5 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad"} {"level":"info","ts":1680276145.7237008,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad","partitioning":{"DesiredState":{"abc.xyz.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":3,"nvidia.com/mig-2g.20gb":2}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}} {"level":"info","ts":1680276145.7237358,"msg":"applying desired partitioning","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad"} {"level":"info","ts":1680276145.7238317,"msg":"partitioning node","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad","node":"abc.xyz.com","partitioning":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":3,"nvidia.com/mig-2g.20gb":2}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}} {"level":"info","ts":1680276145.7997258,"msg":"plan applied","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"exec-dev-14109-2768846280","namespace":"likku"},"namespace":"likku","name":"exec-dev-14109-2768846280","reconcileID":"881671e9-93e1-47ac-836f-d0776fb398ad"} {"level":"info","ts":1680276180.721883,"msg":"processing pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84"} {"level":"info","ts":1680276180.7226257,"msg":"found 5 pending pods","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84"} {"level":"info","ts":1680276180.7226565,"msg":"5 out of 5 pending pods could be helped","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84"} {"level":"info","ts":1680276180.723646,"msg":"computed desired partitioning state","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84","partitioning":{"DesiredState":{"abc.xyz.com":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":5,"nvidia.com/mig-2g.20gb":1}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}}}} {"level":"info","ts":1680276180.723694,"msg":"applying desired partitioning","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84"} {"level":"info","ts":1680276180.7238033,"msg":"partitioning node","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84","node":"abc.xyz.com","partitioning":{"GPUs":[{"GPUIndex":0,"Resources":{"nvidia.com/mig-1g.10gb":5,"nvidia.com/mig-2g.20gb":1}},{"GPUIndex":1,"Resources":{}},{"GPUIndex":2,"Resources":{}},{"GPUIndex":3,"Resources":{}},{"GPUIndex":4,"Resources":{}},{"GPUIndex":5,"Resources":{}},{"GPUIndex":6,"Resources":{}},{"GPUIndex":7,"Resources":{}}]}} {"level":"info","ts":1680276180.798511,"msg":"plan applied","controller":"mig-partitioner-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"prometheus-prometheus-kube-prometheus-prometheus-0","namespace":"prometheus"},"namespace":"prometheus","name":"prometheus-prometheus-kube-prometheus-prometheus-0","reconcileID":"84e179f8-1db7-45a4-94de-04720538fd84"}

`kubectl describe pod nebuly-nos-nebuly-nos-gpu-partitioner-7c6bd668b5-jcc5p
Name: nebuly-nos-nebuly-nos-gpu-partitioner-7c6bd668b5-jcc5p
Namespace: nebuly-nos
Priority: 0
Node: abc.xyz.com/10.10.14.35
Start Time: Fri, 31 Mar 2023 15:21:05 +0000
Labels: app.kubernetes.io/instance=nebuly-nos-nebuly-nos
app.kubernetes.io/name=nos-gpu-partitioner
control-plane=nos-controller-manager
pod-template-hash=7c6bd668b5
Annotations: kubectl.kubernetes.io/default-container: nos
kubernetes.io/psp: psp-no-restriction
Status: Running
IP: 100.96.170.248
IPs:
IP: 100.96.170.248
Controlled By: ReplicaSet/nebuly-nos-nebuly-nos-gpu-partitioner-7c6bd668b5
Containers:
nos:
Container ID: containerd://54d122b9d07ccc1b1fee00fe51e309900b8450a98ae36436413b9e5011d72dc1
Image: ghcr.io/nebuly-ai/nos-gpu-partitioner:latest
Image ID: ghcr.io/nebuly-ai/nos-gpu-partitioner@sha256:31a8754a14dd709ff3aa81fd9cd0ef8378438a4c068a73d8449bb0c77bc0c65f
Port:
Host Port:
Command:
/gpupartitioner
Args:
--config=gpu_partitioner_config.yaml
State: Running
Started: Fri, 31 Mar 2023 15:21:07 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 128Mi
Requests:
cpu: 10m
memory: 64Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/gpu_partitioner_config.yaml from gpu-partitioner-config (rw,path="gpu_partitioner_config.yaml")
/known_mig_geometries.yaml from known-mig-geometries (rw,path="known_mig_geometries.yaml")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bl8ng (ro)
kube-rbac-proxy:
Container ID: containerd://e9ad42a15c2c202fb5e52329a77bb22a596f785822daf3d25881c535cb43975d
Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0
Image ID: gcr.io/kubebuilder/kube-rbac-proxy@sha256:d99a8d144816b951a67648c12c0b988936ccd25cf3754f3cd85ab8c01592248f
Port: 8443/TCP
Host Port: 0/TCP
Args:
--secure-listen-address=0.0.0.0:8443
--upstream=http://127.0.0.1:8080/
--logtostderr=true
--v=1
State: Running
Started: Fri, 31 Mar 2023 15:21:07 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 128Mi
Requests:
cpu: 5m
memory: 64Mi
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bl8ng (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
gpu-partitioner-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nebuly-nos-nebuly-nos-gpu-partitioner-config
Optional: false
known-mig-geometries:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nebuly-nos-nebuly-nos-gpu-partitioner-known-mig-geometries
Optional: false
kube-api-access-bl8ng:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 13m default-scheduler Successfully assigned nebuly-nos/nebuly-nos-nebuly-nos-gpu-partitioner-7c6bd668b5-jcc5p to lxjh820.phibred.com
Normal Pulled 13m kubelet Container image "ghcr.io/nebuly-ai/nos-gpu-partitioner:latest" already present on machine
Normal Created 13m kubelet Created container nos
Normal Started 13m kubelet Started container nos
Normal Pulled 13m kubelet Container image "gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0" already present on machine
Normal Created 13m kubelet Created container kube-rbac-proxy
Normal Started 13m kubelet Started container kube-rbac-proxy
`
Please let me know any more info is needed .

Thank you

Hi @likku123, from the GPU Partitioner image SHA it looks like nos is still using the old version. The SHA of the new image is: sha256:9c372f7e1ee8478e8da72c919e352ed2a17f74bb60f368e303bc5ec50c889eba.

To pull the latest image, you can either delete the existing image ghcr.io/nebuly-ai/nos-gpu-partitioner:latest from your node or provide the value gpuPartitioner.image.pullPolicy: always to the nos Helm chart.

Hope this helps!

Yipeee!!
At last it got working and I am able to spin the pod.
Thanks a lot @Telemaco019
I am closing this issue for now and will open a new one if any issue persists.
Thank you.