故障 pod describe [root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f Name: ascend-device-plugin-ll46f
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: ascend-device-plugin-sa
Node: master1/10.17.30.131
Start Time: Mon, 30 Mar 2026 11:08:32 +0800
Labels: app.kubernetes.io/managed-by=npu-operator
controller-revision-hash=7df5dcb887
helm.sh/chart=npu-operator-0.15.0
name=ascend-device-plugin-ds
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878
cni.projectcalico.org/podIP: 192.168.137.118/32
cni.projectcalico.org/podIPs: 192.168.137.118/32
scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 192.168.137.118
IPs:
IP: 192.168.137.118
Controlled By: DaemonSet/ascend-device-plugin
Init Containers:
init-permission:
Container ID: containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27
Image: cr.openfuyao.cn/openfuyao/busybox:1.36.1
Image ID: cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176
Port: <none>
Host Port: <none>
Command:
sh
-c
chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin
chmod 750 /var/log/mindx-dl/devicePlugin
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 30 Mar 2026 15:28:32 +0800
Finished: Mon, 30 Mar 2026 15:28:32 +0800
Ready: True
Restart Count: 1
Environment: <none>
Mounts:
/var/log/mindx-dl/devicePlugin from log-path (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Containers:
device-plugin-01:
Container ID: containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab
Image: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0
Image ID: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
--
Args:
device-plugin -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 31 Mar 2026 10:28:58 +0800
Finished: Tue, 31 Mar 2026 10:28:58 +0800
Ready: False
Restart Count: 274
Limits:
cpu: 500m
memory: 500Mi
Requests:
cpu: 500m
memory: 500Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/tmp from tmp (rw)
/usr/local/Ascend/driver from hiai-driver (ro)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/lib/kubelet/pod-resources from pod-resource (rw)
/var/log/mindx-dl/devicePlugin from log-path (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro)
Conditions:
Type Status
PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType: pod-resource:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType: hiai-driver:
Type: HostPath (bare host directory volume)
Path: /usr/local/Ascend/driver
HostPathType: log-path:
Type: HostPath (bare host directory volume)
Path: /var/log/mindx-dl/devicePlugin
HostPathType: DirectoryOrCreate
tmp:
Type: HostPath (bare host directory volume)
Path: /tmp
HostPathType: kube-api-access-gfldg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: openfuyao.com/npu.present=
Tolerations: CriticalAddonsOnly op=Exists
device-plugin=v2:NoSchedule
huawei.com/Ascend910:NoSchedule op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 16m (x205 over 18h) kubelet (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes.
Warning BackOff 2m47s (x5216 over 18h) kubelet Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e)
Normal Pulling 66s (x227 over 19h) kubelet Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" 故障 pod /dev 检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
autofs null tty10 tty34 tty58 vcs5
bsg ppp tty11 tty35 tty59 vcs6
btrfs-control ptmx tty12 tty36 tty6 vcsa
bus pts tty13 tty37 tty60 vcsa1
core random tty14 tty38 tty61 vcsa2
cpu_dma_latency raw tty15 tty39 tty62 vcsa3
cuse relationship_ctrl tty16 tty4 tty63 vcsa4
davinci0 rfkill tty17 tty40 tty7 vcsa5
davinci_manager rtc0 tty18 tty41 tty8 vcsa6
devmm_svm sda tty19 tty42 tty9 vcsu
dri sda1 tty2 tty43 ttyAMA0 vcsu1
fb0 sda2 tty20 tty44 ttyS0 vcsu2
fd sg0 tty21 tty45 ttyS1 vcsu3
full sg1 tty22 tty46 ttyS2 vcsu4
fuse sg2 tty23 tty47 ttyS3 vcsu5
hidraw0 shm tty24 tty48 uhid vcsu6
hidraw1 snapshot tty25 tty49 uinput vfio
hisi_hdc sr0 tty26 tty5 urandom vga_arbiter
hwrng sr1 tty27 tty50 usbmon0 vhost-net
input stderr tty28 tty51 usbmon1 vhost-vsock
kmsg stdin tty29 tty52 usbmon2 vport2p1
loop-control stdout tty3 tty53 vcs zero
mapper termination-log tty30 tty54 vcs1
mem tty tty31 tty55 vcs2
mqueue tty0 tty32 tty56 vcs3
net tty1 tty33 tty57 vcs4 故障 pod 驱动检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
total 44K
drwxr-xr-x 8 root root 4.0K Mar 27 08:03 .
drwxr-xr-x 3 root root 4.0K Mar 31 02:34 ..
drwxr-xr-x 2 root root 4.0K Mar 27 08:01 bin
-r--r--r-- 1 root root 20 Mar 27 08:01 build.info
dr-xr-x--- 2 root root 4.0K Mar 27 08:01 device
dr-x------ 41 root root 4.0K Mar 27 08:01 kernel
drwxr-xr-x 6 root root 4.0K Mar 27 08:01 lib64
-r--r----- 1 root root 56 Mar 27 08:01 scene.info
dr-xr-x--- 2 root root 4.0K Mar 27 08:01 script
drwxr-xr-x 2 root root 4.0K Mar 27 08:01 tools
-r--r--r-- 1 root root 352 Mar 27 08:03 version.info 故障 pod 日志 [root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
[INFO] 2026/03/31 06:46:54.593254 1 hwlog/api.go:108 devicePlugin.log's logger init success
[INFO] 2026/03/31 06:46:54.593449 1 main.go:187 ascend device plugin starting and the version is v6.0.0_linux-aarch64
[INFO] 2026/03/31 06:46:54.593494 1 main.go:188 ascend device plugin starting scene is center
[INFO] 2026/03/31 06:46:54.787930 1 devmanager/devmanager.go:104 the dcmi version is 24.1.rc3
[ERROR] 2026/03/31 06:46:54.788019 1 devmanager/devmanager.go:211 get error card quantity: 0
[ERROR] 2026/03/31 06:46:54.788052 1 devmanager/devmanager.go:195 get card list failed for init
[ERROR] 2026/03/31 06:46:54.788101 1 main.go:203 init devmanager failed, err: auto init failed, err: get card list failed for init 故障 pod 驱动检查 [root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH'
Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init)
/usr/local/Ascend/driver/lib64/driver/libdcmi.so
command terminated with exit code 137
[root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep
root 21578 1 0 Mar30 ? 00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid 检查服务状态? [root@master1 ~]# systemctl status ascend-dmi
Unit ascend-dmi.service could not be found.
[root@master1 ~]# systemctl status ascend-dkms
Unit ascend-dkms.service could not be found.
[root@master1 ~]# systemctl status npu-smi
Unit npu-smi.service could not be found.
[root@master1 ~]# find / -name dmp_daemon 2>/dev/null
[root@master1 ~]# find / -name slogd 2>/dev/null
[root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null
[root@master1 ~]# dcmi 问题,需硬件排查
...