openFuyao NPU-Operator故障排查
故障 pod describe [root@master1 ~]# kubectl -n kube-system describe pod ascend-device-plugin-ll46f Name: ascend-device-plugin-ll46f Namespace: kube-system Priority: 2000001000 Priority Class Name: system-node-critical Service Account: ascend-device-plugin-sa Node: master1/10.17.30.131 Start Time: Mon, 30 Mar 2026 11:08:32 +0800 Labels: app.kubernetes.io/managed-by=npu-operator controller-revision-hash=7df5dcb887 helm.sh/chart=npu-operator-0.15.0 name=ascend-device-plugin-ds pod-template-generation=1 Annotations: cni.projectcalico.org/containerID: c1f2adcaeaaf2bdcf0a6e09730f68231a293074e31d58f61997f714dfb520878 cni.projectcalico.org/podIP: 192.168.137.118/32 cni.projectcalico.org/podIPs: 192.168.137.118/32 scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: 192.168.137.118 IPs: IP: 192.168.137.118 Controlled By: DaemonSet/ascend-device-plugin Init Containers: init-permission: Container ID: containerd://4406968a522bea48dfefebae81ec53644312762af4781c25de689952ed6c2d27 Image: cr.openfuyao.cn/openfuyao/busybox:1.36.1 Image ID: cr.openfuyao.cn/openfuyao/busybox@sha256:4b8407fadd8100c61b097d63efe992b2c033e7d371c9117f7a9462fe87e31176 Port: <none> Host Port: <none> Command: sh -c chown 9000:9000 /var/log/mindx-dl /var/log/mindx-dl/devicePlugin chmod 750 /var/log/mindx-dl/devicePlugin State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 30 Mar 2026 15:28:32 +0800 Finished: Mon, 30 Mar 2026 15:28:32 +0800 Ready: True Restart Count: 1 Environment: <none> Mounts: /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Containers: device-plugin-01: Container ID: containerd://fcc0c4742285847e2621a9a9217502307fc7e28644fbf86b32f9c11d67a2c0ab Image: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0 Image ID: cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin@sha256:a5b9612b21bcd35384f9f19a05b2d7915b865e7b2be6a30bfd7806a9b8a86f58 Port: <none> Host Port: <none> Command: /bin/bash -c -- Args: device-plugin -useAscendDocker=true -volcanoType=false -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 31 Mar 2026 10:28:58 +0800 Finished: Tue, 31 Mar 2026 10:28:58 +0800 Ready: False Restart Count: 274 Limits: cpu: 500m memory: 500Mi Requests: cpu: 500m memory: 500Mi Environment: NODE_NAME: (v1:spec.nodeName) Mounts: /tmp from tmp (rw) /usr/local/Ascend/driver from hiai-driver (ro) /var/lib/kubelet/device-plugins from device-plugin (rw) /var/lib/kubelet/pod-resources from pod-resource (rw) /var/log/mindx-dl/devicePlugin from log-path (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gfldg (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: pod-resource: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/pod-resources HostPathType: hiai-driver: Type: HostPath (bare host directory volume) Path: /usr/local/Ascend/driver HostPathType: log-path: Type: HostPath (bare host directory volume) Path: /var/log/mindx-dl/devicePlugin HostPathType: DirectoryOrCreate tmp: Type: HostPath (bare host directory volume) Path: /tmp HostPathType: kube-api-access-gfldg: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt Optional: false DownwardAPI: true QoS Class: Burstable Node-Selectors: openfuyao.com/npu.present= Tolerations: CriticalAddonsOnly op=Exists device-plugin=v2:NoSchedule huawei.com/Ascend910:NoSchedule op=Exists node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 16m (x205 over 18h) kubelet (combined from similar events): Successfully pulled image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" in 403ms (403ms including waiting). Image size: 48017174 bytes. Warning BackOff 2m47s (x5216 over 18h) kubelet Back-off restarting failed container device-plugin-01 in pod ascend-device-plugin-ll46f_kube-system(8edcd384-ab2d-4998-8077-5ac58801c79e) Normal Pulling 66s (x227 over 19h) kubelet Pulling image "cr.openfuyao.cn/openfuyao/ascend-image/ascend-k8sdeviceplugin:v6.0.0" 故障 pod /dev 检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls /dev Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) autofs null tty10 tty34 tty58 vcs5 bsg ppp tty11 tty35 tty59 vcs6 btrfs-control ptmx tty12 tty36 tty6 vcsa bus pts tty13 tty37 tty60 vcsa1 core random tty14 tty38 tty61 vcsa2 cpu_dma_latency raw tty15 tty39 tty62 vcsa3 cuse relationship_ctrl tty16 tty4 tty63 vcsa4 davinci0 rfkill tty17 tty40 tty7 vcsa5 davinci_manager rtc0 tty18 tty41 tty8 vcsa6 devmm_svm sda tty19 tty42 tty9 vcsu dri sda1 tty2 tty43 ttyAMA0 vcsu1 fb0 sda2 tty20 tty44 ttyS0 vcsu2 fd sg0 tty21 tty45 ttyS1 vcsu3 full sg1 tty22 tty46 ttyS2 vcsu4 fuse sg2 tty23 tty47 ttyS3 vcsu5 hidraw0 shm tty24 tty48 uhid vcsu6 hidraw1 snapshot tty25 tty49 uinput vfio hisi_hdc sr0 tty26 tty5 urandom vga_arbiter hwrng sr1 tty27 tty50 usbmon0 vhost-net input stderr tty28 tty51 usbmon1 vhost-vsock kmsg stdin tty29 tty52 usbmon2 vport2p1 loop-control stdout tty3 tty53 vcs zero mapper termination-log tty30 tty54 vcs1 mem tty tty31 tty55 vcs2 mqueue tty0 tty32 tty56 vcs3 net tty1 tty33 tty57 vcs4 故障 pod 驱动检查 [root@master1 fuyao-26.3-rc3]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- ls -lha /usr/local/Ascend/driver Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) total 44K drwxr-xr-x 8 root root 4.0K Mar 27 08:03 . drwxr-xr-x 3 root root 4.0K Mar 31 02:34 .. drwxr-xr-x 2 root root 4.0K Mar 27 08:01 bin -r--r--r-- 1 root root 20 Mar 27 08:01 build.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 device dr-x------ 41 root root 4.0K Mar 27 08:01 kernel drwxr-xr-x 6 root root 4.0K Mar 27 08:01 lib64 -r--r----- 1 root root 56 Mar 27 08:01 scene.info dr-xr-x--- 2 root root 4.0K Mar 27 08:01 script drwxr-xr-x 2 root root 4.0K Mar 27 08:01 tools -r--r--r-- 1 root root 352 Mar 27 08:03 version.info 故障 pod 日志 [root@master1 ~]# kubectl -n kube-system logs daemonsets/ascend-device-plugin --previous Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) [INFO] 2026/03/31 06:46:54.593254 1 hwlog/api.go:108 devicePlugin.log's logger init success [INFO] 2026/03/31 06:46:54.593449 1 main.go:187 ascend device plugin starting and the version is v6.0.0_linux-aarch64 [INFO] 2026/03/31 06:46:54.593494 1 main.go:188 ascend device plugin starting scene is center [INFO] 2026/03/31 06:46:54.787930 1 devmanager/devmanager.go:104 the dcmi version is 24.1.rc3 [ERROR] 2026/03/31 06:46:54.788019 1 devmanager/devmanager.go:211 get error card quantity: 0 [ERROR] 2026/03/31 06:46:54.788052 1 devmanager/devmanager.go:195 get card list failed for init [ERROR] 2026/03/31 06:46:54.788101 1 main.go:203 init devmanager failed, err: auto init failed, err: get card list failed for init 故障 pod 驱动检查 [root@master1 ~]# kubectl -n kube-system exec -it daemonsets/ascend-device-plugin -- bash -c 'find /usr/local/Ascend/driver -name libdcmi.so 2>/dev/null; echo $LD_LIBRARY_PATH' Defaulted container "device-plugin-01" out of: device-plugin-01, init-permission (init) /usr/local/Ascend/driver/lib64/driver/libdcmi.so command terminated with exit code 137 [root@master1 ~]# ps -ef | grep -E 'dmp_daemon|slogd' | grep -v grep root 21578 1 0 Mar30 ? 00:00:19 /usr/sbin/rsyslogd -n -i/var/run/rsyslogd.pid 检查服务状态? [root@master1 ~]# systemctl status ascend-dmi Unit ascend-dmi.service could not be found. [root@master1 ~]# systemctl status ascend-dkms Unit ascend-dkms.service could not be found. [root@master1 ~]# systemctl status npu-smi Unit npu-smi.service could not be found. [root@master1 ~]# find / -name dmp_daemon 2>/dev/null [root@master1 ~]# find / -name slogd 2>/dev/null [root@master1 ~]# ls -l /var/dmp_daemon /var/slogd 2>/dev/null [root@master1 ~]# dcmi 问题,需硬件排查 ...