[BUG] Dataload not ready
zhujian7 opened this issue · 2 comments
What is your environment(Kubernetes version, Fluid version, etc.)
kubernetes: GKE v1.29.5-gke.1091002, with 3 nodes
fluid version: 1.0.0
Describe the bug
After a dataset and alluxioruntime are ready, there is no status for the dataload.
Dataset
╰─# k get dataset model-data
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
model-data 28.78GiB 0.00B 10.00GiB 0.0% Bound 4h56m
╰─# k get dataset model-data -oyaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
creationTimestamp: "2024-07-12T09:35:36Z"
finalizers:
- fluid-dataset-controller-finalizer
generation: 1
name: model-data
namespace: default
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: 8ab7d8e23369e21ae148a8da41c781120946544f13f41513d41feab82c26a66a-my-model-dataset
uid: 639a4e72-0f81-496d-a1f1-eb7c0f27a49b
resourceVersion: "15696"
uid: 3d4c37d5-eedf-4567-af83-b59ac38f2a41
spec:
mounts:
- encryptOptions:
- name: aws.accessKeyId
valueFrom:
secretKeyRef:
key: aws.accessKeyId
name: access-key
- name: aws.secretKey
valueFrom:
secretKeyRef:
key: aws.secretKey
name: access-key
mountPoint: s3://zj-models
name: models
options:
alluxio.underfs.s3.endpoint: s3.amazonaws.com
alluxio.underfs.s3.region: us-east-1
path: /
status:
cacheStates:
cacheCapacity: 10.00GiB
cacheHitRatio: 0.0%
cacheThroughputRatio: 0.0%
cached: 0.00B
cachedPercentage: 0.0%
localHitRatio: 0.0%
localThroughputRatio: 0.0%
remoteHitRatio: 0.0%
remoteThroughputRatio: 0.0%
conditions:
- lastTransitionTime: "2024-07-12T09:38:41Z"
lastUpdateTime: "2024-07-12T09:38:41Z"
message: The ddc runtime is ready.
reason: DatasetReady
status: "True"
type: Ready
fileNum: "131"
hcfs:
endpoint: alluxio://model-data-master-0.default:22136
underlayerFileSystemVersion: 3.3.1
mounts:
- encryptOptions:
- name: aws.accessKeyId
valueFrom:
secretKeyRef:
key: aws.accessKeyId
name: access-key
- name: aws.secretKey
valueFrom:
secretKeyRef:
key: aws.secretKey
name: access-key
mountPoint: s3://zj-models
name: models
options:
alluxio.underfs.s3.endpoint: s3.amazonaws.com
alluxio.underfs.s3.region: us-east-1
path: /
phase: Bound
runtimes:
- category: Accelerate
name: model-data
namespace: default
type: alluxio
ufsTotal: 28.78GiB
Alluxioruntime
╰─# k get alluxioruntime model-data
NAME MASTER PHASE WORKER PHASE FUSE PHASE AGE
model-data Ready Ready Ready 4h58m
╰─# k get alluxioruntime model-data -oyaml
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
creationTimestamp: "2024-07-12T09:35:26Z"
finalizers:
- alluxio-runtime-controller-finalizer
generation: 4
name: model-data
namespace: default
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: 8ab7d8e23369e21ae148a8da41c781120946544f13f41513d41feab82c26a66a-my-model-dataset
uid: 639a4e72-0f81-496d-a1f1-eb7c0f27a49b
- apiVersion: data.fluid.io/v1alpha1
kind: Dataset
name: model-data
uid: 3d4c37d5-eedf-4567-af83-b59ac38f2a41
resourceVersion: "15694"
uid: 1de43c9b-c36f-4d95-9872-6979481bd29e
spec:
replicas: 1
tieredstore:
levels:
- high: "0.99"
low: "0.99"
mediumtype: MEM
path: /dev/shm
quota: 10Gi
volumeType: hostPath
status:
cacheAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: fluid.io/f-default-model-data
operator: In
values:
- "true"
weight: 100
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms: []
cacheStates:
cacheCapacity: 10.00GiB
cacheHitRatio: 0.0%
cacheThroughputRatio: 0.0%
cached: 0.00B
cachedPercentage: 0.0%
localHitRatio: 0.0%
localThroughputRatio: 0.0%
remoteHitRatio: 0.0%
remoteThroughputRatio: 0.0%
conditions:
- lastProbeTime: "2024-07-12T09:35:56Z"
lastTransitionTime: "2024-07-12T09:35:56Z"
message: The master is initialized.
reason: Master is initialized
status: "True"
type: MasterInitialized
- lastProbeTime: "2024-07-12T09:38:36Z"
lastTransitionTime: "2024-07-12T09:37:36Z"
message: The master is ready.
reason: Master is ready
status: "True"
type: MasterReady
- lastProbeTime: "2024-07-12T09:37:36Z"
lastTransitionTime: "2024-07-12T09:37:36Z"
message: The workers are initialized.
reason: Workers are initialized
status: "True"
type: WorkersInitialized
- lastProbeTime: "2024-07-12T09:38:36Z"
lastTransitionTime: "2024-07-12T09:38:36Z"
message: The workers are ready.
reason: Workers are ready
status: "True"
type: WorkersReady
- lastProbeTime: "2024-07-12T09:38:42Z"
lastTransitionTime: "2024-07-12T09:38:42Z"
message: The fuses are ready
reason: The Fuses are ready.
status: "False"
type: FusesReady
currentFuseNumberScheduled: 0
currentMasterNumberScheduled: 1
currentWorkerNumberScheduled: 1
desiredFuseNumberScheduled: 0
desiredMasterNumberScheduled: 1
desiredWorkerNumberScheduled: 1
fuseNumberReady: 0
fusePhase: Ready
masterNumberReady: 1
masterPhase: Ready
selector: app=alluxio,release=model-data,role=alluxio-worker
setupDuration: 3m16s
valueFile: model-data-alluxio-values
workerNumberAvailable: 1
workerNumberReady: 1
workerPhase: Ready
The condition of the alluxioruntime does not seem to make sense:
- lastProbeTime: "2024-07-12T09:38:42Z"
lastTransitionTime: "2024-07-12T09:38:42Z"
message: The fuses are ready
reason: The Fuses are ready.
status: "False"
type: FusesReady
Dataload
╰─# k get dataloads model-dataload
NAME DATASET PHASE AGE DURATION
model-dataload model-data 4h58m
╰─# k get dataloads model-dataload -oyaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
creationTimestamp: "2024-07-12T09:35:30Z"
generation: 1
name: model-dataload
namespace: default
ownerReferences:
- apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
name: 8ab7d8e23369e21ae148a8da41c781120946544f13f41513d41feab82c26a66a-my-dataload
uid: 856a47f0-974f-481f-ae65-74cb76a26b63
resourceVersion: "11794"
uid: 118a4338-a77f-4a12-bd35-03e2cec37dc3
spec:
dataset:
name: model-data
namespace: default
loadMetadata: true
policy: Once
target:
- path: /Qwen1.5-7B-Chat
replicas: 1
What you expect to happen:
The dataloads should have a status.
How to reproduce it
Create the above 3 resources on a gke cluster
Additional Information
daemonset status
╰─# k get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
model-data-fuse 0 0 0 0 0 fluid.io/f-default-model-data=true 5h2m
╰─# k describe ds model-data-fuse
Name: model-data-fuse
Selector: app=alluxio,chart=alluxio-0.9.13,heritage=Helm,release=model-data,role=alluxio-fuse
Node-Selector: fluid.io/f-default-model-data=true
Labels: app=alluxio
app.kubernetes.io/instance=model-data
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=alluxio
chart=alluxio-0.9.13
fluid.io/managed-by=fluid
heritage=Helm
release=model-data
role=alluxio-fuse
Annotations: deprecated.daemonset.template.generation: 1
meta.helm.sh/release-name: model-data
meta.helm.sh/release-namespace: default
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=alluxio
chart=alluxio-0.9.13
heritage=Helm
release=model-data
role=alluxio-fuse
Annotations: sidecar.istio.io/inject: false
Containers:
alluxio-fuse:
Image: alluxio/alluxio-dev:2.9.0
Port: <none>
Host Port: <none>
Command:
/entrypoint.sh
Args:
fuse
--fuse-opts=kernel_cache,ro,attr_timeout=7200,entry_timeout=7200,allow_other
/tmp/runtime-mnt/alluxio/default/model-data/alluxio-fuse
/
Environment Variables from:
model-data-config ConfigMap Optional: false
Environment:
ALLUXIO_CLIENT_HOSTNAME: (v1:status.hostIP)
ALLUXIO_CLIENT_JAVA_OPTS: -Dalluxio.user.hostname=${ALLUXIO_CLIENT_HOSTNAME}
FLUID_RUNTIME_TYPE: alluxio
FLUID_RUNTIME_NS: default
FLUID_RUNTIME_NAME: model-data
Mounts:
/dev/fuse from alluxio-fuse-device (rw)
/dev/shm/default/model-data from mem (rw)
/tmp/runtime-mnt/alluxio/default/model-data from alluxio-fuse-mount (rw)
Volumes:
alluxio-fuse-device:
Type: HostPath (bare host directory volume)
Path: /dev/fuse
HostPathType: CharDevice
alluxio-fuse-mount:
Type: HostPath (bare host directory volume)
Path: /tmp/runtime-mnt/alluxio/default/model-data
HostPathType: DirectoryOrCreate
mem:
Type: HostPath (bare host directory volume)
Path: /dev/shm/default/model-data
HostPathType: DirectoryOrCreate
Events: <none>
csi-nodeplugin logs
╰─# k logs -f -n fluid-system csi-nodeplugin-fluid-hj5q6 -c plugins
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/pods
+ local dir=/var/lib/kubelet/pods
+ '[' '!' -d /var/lib/kubelet/pods ']'
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/plugins
+ local dir=/var/lib/kubelet/plugins
+ '[' '!' -d /var/lib/kubelet/plugins ']'
+ rm -f /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
+ mkdir -p /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io
+ fluid-csi start --nodeid=gke-zj-cluster-1-default-pool-e44ebc24-4qhp --endpoint=unix:///var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock --v=5 --feature-gates=FuseRecovery=false --prune-fs=fuse.alluxio-fuse,fuse.jindofs-fuse,fuse.juicefs,fuse.goosefs-fuse,ossfs,alifuse.aliyun-alinas-efc '--prune-path="/tmp/runtime-mnt"' --pprof-addr=:6060 --kubelet-kube-config=/var/lib/kubelet/kubeconfig
2024/07/12 17:35:18 BuildDate: 2024-04-16_03:40:03
2024/07/12 17:35:18 GitCommit: 31f54333fc3f0c62d1826fa3a2209e83684a4617
2024/07/12 17:35:18 GitTreeState: clean
2024/07/12 17:35:18 GoVersion: go1.18.10
2024/07/12 17:35:18 Compiler: gc
2024/07/12 17:35:18 Platform: linux/amd64
I0712 17:35:18.539558 9 csi.go:135] Enabling pprof with address :6060
I0712 17:35:18.539582 9 csi.go:147] Starting pprof HTTP server at :6060
I0712 17:35:19.244994 9 register.go:48] Registering plugins to controller manager
I0712 17:35:19.247477 9 driver.go:54] Driver: fuse.csi.fluid.io version: 1.0.0
I0712 17:35:19.247523 9 driver.go:57] protocol: unix addr: /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
I0712 17:35:19.247549 9 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME
I0712 17:35:19.247559 9 driver.go:93] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0712 17:35:19.247594 9 register.go:54] recover is not enabled
I0712 17:35:19.247602 9 register.go:48] Registering updatedbconf to controller manager
I0712 17:35:19.247647 9 register.go:37] backup old /etc/updatedb.conf to /etc/updatedb.conf.backup
I0712 17:35:19.247766 9 register.go:43] backup complete, now update /etc/updatedb.conf
I0712 17:35:19.248427 9 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock", Net:"unix"}
I0712 17:35:19.946395 9 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginInfo
I0712 17:35:19.946422 9 utils.go:98] GRPC request: {}
I0712 17:35:19.948840 9 identityserver-default.go:32] Using default GetPluginInfo
I0712 17:35:19.948851 9 utils.go:103] GRPC response: {"name":"fuse.csi.fluid.io","vendor_version":"1.0.0"}
I0712 17:35:20.787750 9 utils.go:97] GRPC call: /csi.v1.Node/NodeGetInfo
I0712 17:35:20.787778 9 utils.go:98] GRPC request: {}
I0712 17:35:20.787863 9 nodeserver-default.go:40] Using default NodeGetInfo
I0712 17:35:20.787871 9 utils.go:103] GRPC response: {"node_id":"gke-zj-cluster-1-default-pool-e44ebc24-4qhp"}
╰─# k logs -f -n fluid-system csi-nodeplugin-fluid-kjcqz -c plugins
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/pods
+ local dir=/var/lib/kubelet/pods
+ '[' '!' -d /var/lib/kubelet/pods ']'
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/plugins
+ local dir=/var/lib/kubelet/plugins
+ '[' '!' -d /var/lib/kubelet/plugins ']'
+ rm -f /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
+ mkdir -p /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io
+ fluid-csi start --nodeid=gke-zj-cluster-1-default-pool-e44ebc24-2245 --endpoint=unix:///var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock --v=5 --feature-gates=FuseRecovery=false --prune-fs=fuse.alluxio-fuse,fuse.jindofs-fuse,fuse.juicefs,fuse.goosefs-fuse,ossfs,alifuse.aliyun-alinas-efc '--prune-path="/tmp/runtime-mnt"' --pprof-addr=:6060 --kubelet-kube-config=/var/lib/kubelet/kubeconfig
2024/07/12 17:35:18 BuildDate: 2024-04-16_03:40:03
2024/07/12 17:35:18 GitCommit: 31f54333fc3f0c62d1826fa3a2209e83684a4617
2024/07/12 17:35:18 GitTreeState: clean
2024/07/12 17:35:18 GoVersion: go1.18.10
2024/07/12 17:35:18 Compiler: gc
2024/07/12 17:35:18 Platform: linux/amd64
I0712 17:35:18.558063 9 csi.go:135] Enabling pprof with address :6060
I0712 17:35:18.558084 9 csi.go:147] Starting pprof HTTP server at :6060
I0712 17:35:19.266734 9 register.go:48] Registering plugins to controller manager
I0712 17:35:19.268770 9 driver.go:54] Driver: fuse.csi.fluid.io version: 1.0.0
I0712 17:35:19.268842 9 driver.go:57] protocol: unix addr: /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
I0712 17:35:19.268869 9 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME
I0712 17:35:19.268881 9 driver.go:93] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0712 17:35:19.268936 9 register.go:54] recover is not enabled
I0712 17:35:19.268955 9 register.go:48] Registering updatedbconf to controller manager
I0712 17:35:19.269048 9 register.go:37] backup old /etc/updatedb.conf to /etc/updatedb.conf.backup
I0712 17:35:19.269165 9 register.go:43] backup complete, now update /etc/updatedb.conf
I0712 17:35:19.270073 9 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock", Net:"unix"}
I0712 17:35:19.755443 9 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginInfo
I0712 17:35:19.755475 9 utils.go:98] GRPC request: {}
I0712 17:35:19.758483 9 identityserver-default.go:32] Using default GetPluginInfo
I0712 17:35:19.758499 9 utils.go:103] GRPC response: {"name":"fuse.csi.fluid.io","vendor_version":"1.0.0"}
I0712 17:35:20.478187 9 utils.go:97] GRPC call: /csi.v1.Node/NodeGetInfo
I0712 17:35:20.478218 9 utils.go:98] GRPC request: {}
I0712 17:35:20.478329 9 nodeserver-default.go:40] Using default NodeGetInfo
I0712 17:35:20.478346 9 utils.go:103] GRPC response: {"node_id":"gke-zj-cluster-1-default-pool-e44ebc24-2245"}
╰─# k logs -f -n fluid-system csi-nodeplugin-fluid-xp77b -c plugins
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/pods
+ local dir=/var/lib/kubelet/pods
+ '[' '!' -d /var/lib/kubelet/pods ']'
+ check_kubelet_rootdir_subfolder /var/lib/kubelet/plugins
+ local dir=/var/lib/kubelet/plugins
+ '[' '!' -d /var/lib/kubelet/plugins ']'
+ rm -f /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
+ mkdir -p /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io
+ fluid-csi start --nodeid=gke-zj-cluster-1-default-pool-e44ebc24-5qhd --endpoint=unix:///var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock --v=5 --feature-gates=FuseRecovery=false --prune-fs=fuse.alluxio-fuse,fuse.jindofs-fuse,fuse.juicefs,fuse.goosefs-fuse,ossfs,alifuse.aliyun-alinas-efc '--prune-path="/tmp/runtime-mnt"' --pprof-addr=:6060 --kubelet-kube-config=/var/lib/kubelet/kubeconfig
2024/07/12 17:35:19 BuildDate: 2024-04-16_03:40:03
2024/07/12 17:35:19 GitCommit: 31f54333fc3f0c62d1826fa3a2209e83684a4617
2024/07/12 17:35:19 GitTreeState: clean
2024/07/12 17:35:19 GoVersion: go1.18.10
2024/07/12 17:35:19 Compiler: gc
2024/07/12 17:35:19 Platform: linux/amd64
I0712 17:35:19.526761 9 csi.go:135] Enabling pprof with address :6060
I0712 17:35:19.526794 9 csi.go:147] Starting pprof HTTP server at :6060
I0712 17:35:20.265849 9 register.go:54] recover is not enabled
I0712 17:35:20.265880 9 register.go:48] Registering updatedbconf to controller manager
I0712 17:35:20.265973 9 register.go:37] backup old /etc/updatedb.conf to /etc/updatedb.conf.backup
I0712 17:35:20.266082 9 register.go:43] backup complete, now update /etc/updatedb.conf
I0712 17:35:20.266140 9 register.go:48] Registering plugins to controller manager
I0712 17:35:20.269187 9 driver.go:54] Driver: fuse.csi.fluid.io version: 1.0.0
I0712 17:35:20.269286 9 driver.go:57] protocol: unix addr: /var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock
I0712 17:35:20.269327 9 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME
I0712 17:35:20.269338 9 driver.go:93] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0712 17:35:20.270065 9 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//var/lib/kubelet/csi-plugins/fuse.csi.fluid.io/csi.sock", Net:"unix"}
I0712 17:35:20.620948 9 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginInfo
I0712 17:35:20.620979 9 utils.go:98] GRPC request: {}
I0712 17:35:20.626132 9 identityserver-default.go:32] Using default GetPluginInfo
I0712 17:35:20.626158 9 utils.go:103] GRPC response: {"name":"fuse.csi.fluid.io","vendor_version":"1.0.0"}
I0712 17:35:21.452422 9 utils.go:97] GRPC call: /csi.v1.Node/NodeGetInfo
I0712 17:35:21.452450 9 utils.go:98] GRPC request: {}
I0712 17:35:21.452524 9 nodeserver-default.go:40] Using default NodeGetInfo
I0712 17:35:21.452559 9 utils.go:103] GRPC response: {"node_id":"gke-zj-cluster-1-default-pool-e44ebc24-5qhd"}
model data work logs
╰─# k get pod 130 ↵
NAME READY STATUS RESTARTS AGE
model-data-master-0 2/2 Running 0 5h5m
model-data-worker-0 2/2 Running 0 5h3m
╰─# k logs -f model-data-worker-0
Defaulted container "alluxio-worker" out of: alluxio-worker, alluxio-job-worker
Exception in thread "main" java.lang.RuntimeException: Invalid property key ALLUXIO_CLIENT_HOSTNAME
at alluxio.conf.InstancedConfiguration.lookupRecursively(InstancedConfiguration.java:441)
at alluxio.conf.InstancedConfiguration.lookup(InstancedConfiguration.java:412)
at alluxio.conf.InstancedConfiguration.isResolvable(InstancedConfiguration.java:152)
at alluxio.conf.InstancedConfiguration.isSet(InstancedConfiguration.java:162)
at alluxio.conf.AlluxioConfiguration.getOrDefault(AlluxioConfiguration.java:65)
at alluxio.cli.GetConf.getConfImpl(GetConf.java:189)
at alluxio.cli.GetConf.getConf(GetConf.java:146)
at alluxio.cli.GetConf.main(GetConf.java:267)
2024-07-12 09:38:27,017 INFO NettyUtils - EPOLL_MODE is available
2024-07-12 09:38:27,760 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.128.15.204, rack=null)
2024-07-12 09:38:27,792 INFO BlockWorkerFactory - Creating DefaultBlockWorker
2024-07-12 09:38:27,863 INFO DefaultStorageDir - Folder /dev/shm/default/model-data/alluxioworker was created!
2024-07-12 09:38:27,864 INFO DefaultStorageDir - StorageDir initialized: path=/dev/shm/default/model-data/alluxioworker, tier=alluxio.worker.block.meta.DefaultStorageTier@56b3afbe, dirIndex=0, medium=MEM, capacityBytes=10737418240, reservedBytes=0, availableBytes=10737418240
2024-07-12 09:38:27,956 WARN DefaultStorageTier - Failed to verify memory capacity
2024-07-12 09:38:28,075 INFO MetricsSystem - Starting sinks with config: {}.
2024-07-12 09:38:28,078 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-4651482606915682868. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2024-07-12 09:38:28,647 INFO log - Logging initialized @2645ms to org.eclipse.jetty.util.log.Slf4jLog
2024-07-12 09:38:28,726 INFO MetricsHeartbeatContext - Created metrics heartbeat with ID app-866709224198499182. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2024-07-12 09:38:29,443 INFO GrpcDataServer - Alluxio worker gRPC server started, listening on /0.0.0.0:25691
2024-07-12 09:38:29,445 INFO ProcessUtils - Starting Alluxio worker @10.128.15.204:25691.
2024-07-12 09:38:29,445 INFO ProcessUtils - Alluxio version: 2.9.0-d5919d8d80ae7bfdd914ade30620d5ca14f3b67e
2024-07-12 09:38:29,446 INFO ProcessUtils - Java version: 1.8.0_352
2024-07-12 09:38:29,782 INFO BlockMapIterator - Worker register stream batchSize=1000000
2024-07-12 09:38:29,983 INFO RegisterStreamer - Worker 7206219353394242556 - All requests have been sent. Completing the client side.
2024-07-12 09:38:29,989 INFO RegisterStreamer - Worker 7206219353394242556 - Waiting on the master side to complete
2024-07-12 09:38:30,014 INFO RegisterStreamer - 7206219353394242556 - Complete message received from the server. Closing stream
2024-07-12 09:38:30,018 INFO RegisterStreamer - Worker 7206219353394242556 - Finished registration with a stream
2024-07-12 09:38:30,037 INFO WebServer - Alluxio worker web service starting @ /0.0.0.0:20964
2024-07-12 09:38:30,039 INFO Server - jetty-9.4.46.v20220331; built: 2022-03-31T16:38:08.030Z; git: bc17a0369a11ecf40bb92c839b9ef0a8ac50ea18; jvm 1.8.0_352-b08
2024-07-12 09:38:30,232 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@6d23017e{/metrics/json,null,AVAILABLE}
2024-07-12 09:38:30,234 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@54dcfa5a{/metrics/prometheus,null,AVAILABLE}
2024-07-12 09:38:30,237 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@340b9973{/,file:///opt/alluxio-2.9.0/webui/worker/build/,STARTING} has uncovered http methods for path: /
2024-07-12 09:38:42,056 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@340b9973{/,file:///opt/alluxio-2.9.0/webui/worker/build/,AVAILABLE}
2024-07-12 09:38:42,069 INFO AbstractConnector - Started ServerConnector@10289886{HTTP/1.1, (http/1.1)}{0.0.0.0:20964}
2024-07-12 09:38:42,069 INFO Server - Started @16067ms
2024-07-12 09:38:42,069 INFO WebServer - Alluxio worker web service started @ /0.0.0.0:20964
2024-07-12 09:38:42,084 INFO AlluxioWorkerProcess - Alluxio worker started. id=7206219353394242556, bindHost=0.0.0.0, connectHost=10.128.15.204, rpcPort=25691, webPort=20964
cluster node info
╰─# k get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
gke-zj-cluster-1-default-pool-e44ebc24-2245 Ready <none> 5h23m v1.29.5-gke.1091002 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e2-standard-4,beta.kubernetes.io/os=linux,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=4,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,fluid.io/dataset-num=1,fluid.io/s-alluxio-default-model-data=true,fluid.io/s-default-model-data=true,fluid.io/s-h-alluxio-m-default-model-data=10GiB,fluid.io/s-h-alluxio-t-default-model-data=10GiB,fluid_exclusive=default_model-data,kubernetes.io/arch=amd64,kubernetes.io/hostname=gke-zj-cluster-1-default-pool-e44ebc24-2245,kubernetes.io/os=linux,node.kubernetes.io/instance-type=e2-standard-4,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
gke-zj-cluster-1-default-pool-e44ebc24-4qhp Ready <none> 5h23m v1.29.5-gke.1091002 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e2-standard-4,beta.kubernetes.io/os=linux,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=4,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=gke-zj-cluster-1-default-pool-e44ebc24-4qhp,kubernetes.io/os=linux,node.kubernetes.io/instance-type=e2-standard-4,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
gke-zj-cluster-1-default-pool-e44ebc24-5qhd Ready <none> 5h23m v1.29.5-gke.1091002 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=e2-standard-4,beta.kubernetes.io/os=linux,cloud.google.com/gke-boot-disk=pd-balanced,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-cpu-scaling-level=4,cloud.google.com/gke-logging-variant=DEFAULT,cloud.google.com/gke-max-pods-per-node=110,cloud.google.com/gke-nodepool=default-pool,cloud.google.com/gke-os-distribution=cos,cloud.google.com/gke-provisioning=standard,cloud.google.com/gke-stack-type=IPV4,cloud.google.com/machine-family=e2,cloud.google.com/private-node=false,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=gke-zj-cluster-1-default-pool-e44ebc24-5qhd,kubernetes.io/os=linux,node.kubernetes.io/instance-type=e2-standard-4,topology.gke.io/zone=us-central1-c,topology.kubernetes.io/region=us-central1,topology.kubernetes.io/zone=us-central1-c
@zhujian7 Could you collect the info by following docs:https://github.com/fluid-cloudnative/fluid/blob/master/docs/en/userguide/troubleshooting.md
/close
as it was not reproduced when I tried another time.