AI推理集成部署(InferNex)是一个专为云原生环境下AI推理服务优化所设计的端到端集成部署方案。该方案基于Kubernetes Gateway API Inference Extension (GIE) 和主流LLM技术栈构建,通过Helm Chart将开源网关、智能路由、高性能推理后端、全局KVCache管理、扩缩容决策框架及推理可观测体系等核心加速模块无缝集成。它提供从请求接入、动态路由、推理执行到资源管理与监控的完整加速链路,旨在提升推理吞吐量并降低TTFT/TPOT时延,实现一站式的高效AI服务部署体验。

相关的文档如下:

因为官方仅针对 910 做了验证,手头只有一张 310P ,理论上是可以跑起来,但是需要做一系列修改,本文记录部署遇到的各种问题及其解决方案。

部署后有几个 pod 一直起不来:

NAMESPACE                     NAME                                                          READY   STATUS      RESTARTS       AGE  
ai-inference                  vllm-pd-2p1d-01-decode-54cc4c7579-5h62w                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg                       0/1     Init:0/2    0              5d  
ai-inference                  vllm-pd-2p1d-01-prefill-5c546dbcc-thmkd                       0/1     Pending     0              5d18h  
ai-inference                  vllm-pd-2p1d-01-prefill-fd68f87cf-jjdlc                       0/1     Pending     0              5d  

hccn 问题

似乎是 hccn 找不到

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:31:33 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
  huggingface-download:
    Container ID:  
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-jswfw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  5d                      default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  5d (x2 over 5d)         default-scheduler  0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  9m32s                   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  9m20s (x24 over 9m29s)  default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         8m58s                   default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg to master1
  Warning  FailedMount       44s (x12 over 8m58s)    kubelet            MountVolume.SetUp failed for volume "hccntool" : hostPath type check failed: /usr/bin/hccn_tool is not a file

暂时绕过:

touch /usr/bin/hccn_tool
chmod +x /usr/bin/hccn_tool

huggingface-download 失败

[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp 
Name:             vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp
Namespace:        ai-inference
Priority:         0
Service Account:  default
Node:             master1/10.17.30.131
Start Time:       Tue, 07 Apr 2026 09:45:19 +0800
Labels:           app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
                  app.kubernetes.io/name=inference-backend
                  openfuyao.com/dpSize=1
                  openfuyao.com/engine=vllm
                  openfuyao.com/model=qwen-qwen3-8b
                  openfuyao.com/pdGroupID=qwen3-8b-pd-01
                  openfuyao.com/pdRole=decode
                  openfuyao.com/ppSize=1
                  openfuyao.com/tpSize=1
                  pod-template-hash=6cd64bc69c
Annotations:      checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
                  cni.projectcalico.org/containerID: 32b3384131b69054ca45acc8afe5e272b0ca681ea6d0611b3fec7316e3532e80
                  cni.projectcalico.org/podIP: 192.168.137.155/32
                  cni.projectcalico.org/podIPs: 192.168.137.155/32
                  huawei.com/AscendReal: Ascend310P-0
                  huawei.com/kltDev: Ascend310P-0
Status:           Pending
IP:               192.168.137.155
IPs:
  IP:           192.168.137.155
Controlled By:  ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
  mooncake-config-init:
    Container ID:  containerd://4ede488dd17e33f3a980aee6fa4eac3093ad6366c3854fe85436f81f6e1df7bb
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image ID:      hub.oepkgs.net/openfuyao/mikefarah/yq@sha256:4facc66fdcc785ec961ef7f2185f53f862f462eefe1d50c2eb311c2bb26823e3
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      set -e
      CONFIG_PATH="/app/mooncake.json"
      mkdir -p "$(dirname "$CONFIG_PATH")"
      cat > /tmp/mooncake_config.tpl << 'EOF'
        local_hostname: "$POD_IP"
        metadata_server: "redis://redis-service:6379"
        master_server_address: "mooncake-master-service:30089"
        device_name: ""
        protocol: "ascend"
        global_segment_size: 42949672960
        use_ascend_direct: true
        
      EOF
      POD_IP_VALUE="${POD_IP:-0.0.0.0}"
      sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:45:20 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:  vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:     (v1:status.podIP)
    Mounts:
      /app from mooncake-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
  huggingface-download:
    Container ID:  containerd://84e15b7d9e2f7382181e309c1558174ec48e58ad0ae14f92ae0dfff284da76e5
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image ID:      cr.openfuyao.cn/openfuyao/huggingface-download@sha256:ac86348b5e6934a020c21c4f0ebf81b520194ba8e549f1847ecc7521b82d9a8d
    Port:          <none>
    Host Port:     <none>
    Command:
      hf
      download
      Qwen/Qwen3-8B
    State:          Running
      Started:      Tue, 07 Apr 2026 09:47:33 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 07 Apr 2026 09:45:20 +0800
      Finished:     Tue, 07 Apr 2026 09:47:32 +0800
    Ready:          False
    Restart Count:  1
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
    Mounts:
      /root/.cache from rootcache (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Containers:
  decode-engine:
    Container ID:  
    Image:         hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
    Image ID:      
    Port:          8000/TCP (decode-port)
    Host Port:     0/TCP (decode-port)
    Command:
      /bin/bash
      -c
    Args:
      # PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
      export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
      
      # start vllm service
      vllm serve Qwen/Qwen3-8B \
        --served-model-name Qwen/Qwen3-8B \
        --trust-remote-code \
        --no-enable-prefix-caching \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len 10000 \
        --max-num-batched-tokens 40960 \
        --data-parallel-size 1 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
      
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                    8
      huawei.com/Ascend310P:  1
      memory:                 64Gi
    Requests:
      cpu:                    4
      huawei.com/Ascend310P:  1
      memory:                 32Gi
    Liveness:                 http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:                http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
    Startup:                  http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
    Environment:
      HF_HUB_OFFLINE:        0
      VLLM_USE_V1:           1
      GLOO_SOCKET_IFNAME:    eth0
      TP_SOCKET_IFNAME:      eth0
      HCCL_SOCKET_IFNAME:    eth0
      MOONCAKE_CONFIG_PATH:  /app/mooncake.json
      POD_NAME:              vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
      POD_IP:                 (v1:status.podIP)
    Mounts:
      /app from mooncake-config (ro)
      /dev/shm from shm (rw)
      /etc/ascend_install.info from installinfo (rw)
      /etc/hccn.conf from hccnconf (rw)
      /root/.cache from rootcache (rw)
      /usr/bin/hccn_tool from hccntool (rw)
      /usr/local/Ascend/driver/lib64 from lib64 (rw)
      /usr/local/Ascend/driver/version.info from version (rw)
      /usr/local/bin/npu-smi from npusmi (rw)
      /usr/local/dcmi from dcmi (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  mooncake-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  shm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  24Gi
  dcmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/dcmi
    HostPathType:  
  npusmi:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/bin/npu-smi
    HostPathType:  File
  lib64:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/lib64
    HostPathType:  
  version:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/Ascend/driver/version.info
    HostPathType:  File
  installinfo:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ascend_install.info
    HostPathType:  File
  hccntool:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/bin/hccn_tool
    HostPathType:  File
  hccnconf:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/hccn.conf
    HostPathType:  File
  rootcache:
    Type:          HostPath (bare host directory volume)
    Path:          /home/llm_cache
    HostPathType:  
  kube-api-access-9jvbd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  3m20s               default-scheduler  0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Normal   Scheduled         2m15s               default-scheduler  Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp to master1
  Normal   Pulled            2m14s               kubelet            Container image "hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1" already present on machine
  Normal   Created           2m14s               kubelet            Created container: mooncake-config-init
  Normal   Started           2m14s               kubelet            Started container mooncake-config-init
  Normal   Pulled            1s (x2 over 2m14s)  kubelet            Container image "cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2" already present on machine
  Normal   Created           1s (x2 over 2m14s)  kubelet            Created container: huggingface-download
  Normal   Started           1s (x2 over 2m14s)  kubelet            Started container huggingface-download

查看错误日志:

# 看当前这次的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download

# 看上一次失败的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous

[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download
[root@master1 ~]# 
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
    raise exc
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
    stream = self._connect(request)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 124, in _connect
    stream = self._network_backend.connect_tcp(**kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 240, in snapshot_download
    repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3285, in repo_info
    return method(
           ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3020, in model_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1053, in get
    return self.request(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send
    response = self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth
    response = self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
    response = self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1014, in _send_single_request
    response = transport.handle_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in handle_request
    with map_httpcore_exceptions():
  File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/hf", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/hf.py", line 113, in main
    app()
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1152, in __call__
    raise e
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1135, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 795, in main
    return _main(
           ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 188, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1514, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 224, in download
    _print_result(run_download())
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 185, in run_download
    return snapshot_download(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 324, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Got: ConnectError: [Errno 101] Network is unreachable
An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

确认是网络问题:节点无法访问 HuggingFace(Network is unreachable),而且本地也没有缓存。
解决方案:换国内镜像源(推荐)
Deploymenthuggingface-download init container 里加一个环境变量:

env:
  - name: HF_ENDPOINT
    value: "https://hf-mirror.com"

最好给 decode-engine 也加上,否则报同样的错。

看日志可能没有任何输出

kubectl -n ai-inference logs deployments/vllm-pd-2p1d-01-decode huggingface-download  -f

此时查看 llm 目录大小即可,可以看到不断在变化:

$ watch -n 2 -d 'du -sh /home/llm_cache/'
426M    /home/llm_cache/

310p 运行报错

[root@master1 ~]# kubectl  -n ai-inference logs  vllm-pd-2p1d-01-decode-7d487c49cd-qw89v
Defaulted container "decode-engine" out of: decode-engine, mooncake-config-init (init), huggingface-download (init)
...
INFO 04-07 05:15:19 [__init__.py:217] Platform plugin ascend is activated
(EngineCore_DP0 pid=94) INFO 04-07 05:15:33 [ascend_config.py:55] Linear layer sharding enabled with config: None. Note: This feature works optimally with FLASHCOMM2 and DSA-CP enabled; using it without these features may result in significant performance degradation.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] EngineCore failed to start.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] Traceback (most recent call last):
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     super().__init__(
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self._init_executor()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     check_ascend_device_type()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.
(EngineCore_DP0 pid=94) Process EngineCore_DP0:
(EngineCore_DP0 pid=94) Traceback (most recent call last):
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=94)     self.run()
(EngineCore_DP0 pid=94)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=94)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 72, in run_engine_core
(EngineCore_DP0 pid=94)     raise e
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94)     super().__init__(
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94)     self._init_executor()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94)     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94)     self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94)                   ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94)     check_ascend_device_type()
(EngineCore_DP0 pid=94)   File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94)     assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.

经过排查是 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是针对 910 构建的,找到了官方说明

310p 应该使用带有 310p 后缀的镜像。在 镜像仓库 寻找后替换为 quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler 尝试。

对比 sha256 发现 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是完全的 quay.io/ascend/vllm-ascend:v0.13.0 镜像,sha256 完全一致。

310p vllm-ascend 报错

但是替换为其他带有 310p 后缀的 image ,启动后会报错:

Every 1.0s: kubectl -n ai-inference get pod                                                            master1: Thu Apr  9 08:36:06 2026

NAME                                          READY   STATUS             RESTARTS        AGE
cache-indexer-deployment-65d5b449f6-x9l46     1/1     Running            0               17h
inference-gateway-istio-5f9b7d78f6-7kbrw      1/1     Running            26 (15h ago)    17h
infernex-epp-5cc456bd-4vvmv                   1/1     Running            0               17h
mooncake-master-deployment-74cc5666b7-fr4fq   1/1     Running            0               17h
redis-server-deployment-67566b9765-m66lc      1/1     Running            0               17h
vllm-pd-2p1d-01-decode-7687ccb7b-vg98n        0/1     CrashLoopBackOff   161 (31s ago)   16h
vllm-pd-2p1d-01-prefill-66f7564d7f-tdd84      0/1     Pending            0               43h
vllm-pd-2p1d-01-prefill-fd68f87cf-mhtdk       0/1     Pending            0               40h
vllm-pd-2p1d-01-proxy-7ff4f59865-h8xbw        1/1     Running            0               17h




────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/ascend_multi_connector.py", line 5, in <module>
(APIServer pid=1)     from vllm_ascend.distributed.kv_transfer.kv_p2p.mooncake_layerwise_connector import MooncakeLayerwiseConnector
(APIServer pid=1)   File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py", line 25, in <module>
(APIServer pid=1)     from mooncake.engine import TransferEngine  # type: ignore
(APIServer pid=1)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ModuleNotFoundError: No module named 'mooncake'
(APIServer pid=1) [ERROR] 2026-04-08-08:58:03 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
[root@master1 ~]# kubectl  -n ai-inference describe pod vllm-pd-2p1d-01-decode-7687ccb7b-vg98n | grep -i image:
    Image:         hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
    Image:         cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
    Image:         quay.io/ascend/vllm-ascend:main-310p

针对该问题,openfuyao 给出的方案如下:

@tl.s InferNex在310P环境部署问题排查:

vllm-ascend的310P镜像没有加入mooncake,所以在prefill/decode之间kvcache数据传输无法支持。
https://github.com/vllm-project/vllm-ascend/blob/main/Dockerfile.310p

建议使用聚合模式部署,可参考InferNex聚合模式示例,部署时将`inference-backend.services[0].kvTransferConfig` 配置项删除,即可不使用mooncake相关能力:
https://gitcode.com/openFuyao/InferNex/blob/0.22.2/examples/vllm-aggregated-random-values.yaml

vllm-ascend针对310P在线推理文档:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

310P的推理还未验证过,可以尝试v0.13.0或者 v0.18.0rc1,这两个版本vllm官方有文档支持。

aggregated 模式卡个数错误

环境中只有一张 310p,但是默认申请两张,需要修改两处,分别是资源申请个数 1 ,以及 vllm 启动参数 tensor_parallel_size 设定为 1 。

resources:
          limits:
            cpu: "8"
            huawei.com/Ascend310P: "1"
            memory: 64Gi
          requests:
            cpu: "4"
            huawei.com/Ascend310P: "1"
            memory: 32Gi
...
  # start vllm service
          vllm serve Qwen/Qwen3-8B \
            --served-model-name Qwen/Qwen3-8B \
            --trust-remote-code \
            --enable-prefix-caching \
            --port 8000 \
            --tensor-parallel-size 1 \

bf16 数据类型报错

(EngineCore pid=35) [PID: 35] 2026-04-09-08:27:33.217.113 AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT,DT_FLOAT16,DT_INT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT8,DT_BOOL,DT_DOUBLE,].
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:27:49 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 AI 说法,暂未查证:Ascend 310P 芯片不支持 bfloat16(bf16)数据类型,但 vLLM 在初始化 rotary embedding(RoPE)时使用了 torch.ones(..., dtype=torch.bfloat16),导致 ACL 算子报错。
310P 的算子库支持的浮点类型只有 float32 和 float16,不包含 bf16。

通过参数强制使用 float16 --dtype half 解决:

bashvllm serve Qwen/Qwen3-8B \
  --dtype half \   # 强制使用 float16 而非 bfloat16
  ...其他参数

npu_dynamic_quant 算子报错

解决以上问题后 pod 可以运行更久

(EngineCore pid=35) INFO 04-09 08:43:11 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen3-8B: 1.077223 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:08<00:32,  8.07s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:16<00:24,  8.33s/it]

但最终依然 error ,跟踪报错信息如下:

(EngineCore pid=35) INFO 04-09 08:47:55 [default_loader.py:384] Loading weights took 35.12 seconds
(EngineCore pid=35) INFO 04-09 08:47:57 [model_runner_v1.py:2589] Loading model weights took 17.6043 GB
.(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/4de24ceb58/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:1048] Dynamo bytecode transform time: 13.07 s
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._call_with_optional_nvtx_range(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return callable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     raise BackendCompilerFailed(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.configure_post_pass()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._trace_inner(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     t = dispatch_trace(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         ^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return disable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     (self.create_arg(fn(*args)),),
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                      ^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     out = func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         TraceBack (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling failed
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Tiling Failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]         DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] 
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35)     output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35)     return self._call_with_optional_nvtx_range(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35)     return callable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35)     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35)     raise BackendCompilerFailed(
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35)     compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35)     compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35)     return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35)     self.configure_post_pass()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35)     self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35)     self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35)     AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35)     pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35)     pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35)     return func(*args, **kwds)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35)     search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35)     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35)     return make_fx_tracer.trace(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35)     return self._trace_inner(f, *args)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35)     t = dispatch_trace(
(EngineCore pid=35)         ^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35)     return disable_fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35)     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(EngineCore pid=35)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35)     (self.create_arg(fn(*args)),),
(EngineCore pid=35)                      ^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35)     out = f(*tensors)  # type:ignore[call-arg]
(EngineCore pid=35)           ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35)     quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35)     return fn(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35)     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35)     out = func(*args, **kwargs)
(EngineCore pid=35)           ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35)         TraceBack (most recent call last):
(EngineCore pid=35)         Tiling failed
(EngineCore pid=35)         Tiling Failed.
(EngineCore pid=35)         Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35)         DynamicQuant launch kernel failed.
(EngineCore pid=35) 
(EngineCore pid=35) 
(EngineCore pid=35) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) 
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:48:30 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 ai 的说法:根本原因:norm_quant 融合 Pass 在编译阶段向 310P 发起 npu_dynamic_quant 算子,而 310P 不支持该动态量化算子(或当前 CANN 版本不兼容),导致 Tiling 失败。

openfuyao 社区给出的回应如下:

@tl.s 
看了一下报错日志,应该是310卡不支持vllm-ascend默认开启的算子 DynamicQuantV2,可以加上启动配置项 --enforce-eager 和 --no-quant 尝试一下。
在InferNex中,默认未直接提供的vllm启动参数可以在 inference-backend.services[0].pd.prefill/decode.extraArgs添加。例如:
extraArgs:
    - "--enforce-eager"
    - "--no-quant "
若还是不行,可以尝试更换模型,按照官方310P文档内的示例部署,如 Qwen2.5-7B-Instruct。
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

OOM

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference logs deployments/vllm-pd-2p1d-01 -f 
Defaulted container "aggregated-engine" out of: aggregated-engine, huggingface-download (init)



INFO 04-13 06:21:29 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:21:29 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:21:29 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:21:29 [__init__.py:239] Platform plugin ascend is activated
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
WARNING 04-13 06:21:44 [__init__.py:80] The quantization method 'ascend' already exists and will be overwritten by the quantization config <class 'vllm_ascend._310p.quantization.modelslim_config.AscendModelSlimConfig310'>.
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-7B-Instruct
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] 
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-7B-Instruct', 'model': 'Qwen/Qwen2.5-7B-Instruct', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 4096, 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen2.5-7B-Instruct'], 'block_size': 128, 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True, 'max_num_batched_tokens': 40960, 'kv_events_config': KVEventsConfig(enable_kv_cache_events=True, publisher='zmq', endpoint='tcp://*:5557', replay_endpoint=None, buffer_steps=10000, hwm=100000, max_queue_size=100000, topic='kv-events')}
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT_HTTP_API
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_HOST
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_USE_V1
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) WARNING 04-13 06:22:18 [model.py:1920] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 04-13 06:22:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=40960.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:749] Parameter '--disable-cascade-attn' is a GPU-specific feature. Resetting to False for Ascend.
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:838] Ignored parameter 'disable_flashinfer_prefill'. This is a GPU-specific feature not supported on Ascend. Resetting to False.
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:425] Dynamic EPLB is False
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:426] The number of redundant experts is 0
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:297] Compilation disabled, using eager mode by default
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=1) INFO 04-13 06:22:26 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
INFO 04-13 06:22:53 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:22:53 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:22:53 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:22:53 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(EngineCore pid=35) INFO 04-13 06:23:03 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'vllm_ascend.compilation.compiler_interface.AscendCompiler', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [40960], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=35) WARNING 04-13 06:23:08 [camem.py:66] Failed to import vllm_ascend_C:/vllm-workspace/vllm-ascend/vllm_ascend/vllm_ascend_C.cpython-311-aarch64-linux-gnu.so: undefined symbol: _ZN9pp_matmul17GetPpMatmulTilingERKNS_10MatMulInfoERKNS_12HardwareInfoERjRNS_18PpMatmulTilingDataE. Sleep mode will be disabled. 
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:425] Dynamic EPLB is False
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:426] The number of redundant experts is 0
INFO 04-13 06:23:22 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:23:22 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:23:22 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:23:22 [__init__.py:239] Platform plugin ascend is activated
....(EngineCore pid=35) INFO 04-13 06:24:35 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.137.164:42089 backend=hccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) INFO 04-13 06:24:36 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) WARNING 04-13 06:24:36 [worker.py:306] Bind cpus failed in rank0: Can not get running npu info. Skip binding cpu.
(EngineCore pid=35) INFO 04-13 06:24:37 [model_runner_v1.py:2562] Starting to load model Qwen/Qwen2.5-7B-Instruct...
(EngineCore pid=35) INFO 04-13 06:24:56 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-7B-Instruct: 4.228922 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:10<00:30, 10.12s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:20<00:20, 10.32s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:30<00:10, 10.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00,  9.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00, 10.08s/it]
(EngineCore pid=35) 
(EngineCore pid=35) INFO 04-13 06:25:46 [default_loader.py:384] Loading weights took 40.53 seconds
(EngineCore pid=35) INFO 04-13 06:25:48 [model_runner_v1.py:2589] Loading model weights took 16.2391 GB
.(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().__init__(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     super().profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super()._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self.forward(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().forward(input_)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35)     self.run()
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35)     self._target(*self._args, **self._kwargs)
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35)     raise e
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35)     super().__init__(
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35)     self.model_runner.profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35)     super().profile_run()
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35)                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35)     return super()._dummy_run(
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35)     return func(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35)     outputs = self._model_forward(
(EngineCore pid=35)               ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35)     hidden_states = self.model(
(EngineCore pid=35)                     ^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35)     return self.forward(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35)     hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35)     hidden_states = self.mlp(hidden_states)
(EngineCore pid=35)                     ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35)     gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35)                  ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35)     return self._call_impl(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35)     return forward_call(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35)     return super().forward(input_)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35)     output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35)     return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35)     return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35)     return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35)     return self._op(*args, **kwargs)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35)   File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35)     return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1)     return runner.run(wrapper())
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)   File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-13-06:26:05 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

根据官网说明,似乎单张 310P 只能跑个 0.6B

Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards)

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu

而且需要一些参数

vllm serve Qwen/Qwen3-0.6B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager \
    --dtype float16

Helm Update 报错

每次更新 helm 都需要经历以下几个回合,才能成功:

[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: cannot patch "vllm-pd-2p1d-01" with kind Deployment: Deployment.apps "vllm-pd-2p1d-01" is invalid: spec.selector: Invalid value: {"matchLabels":{"app.kubernetes.io/instance":"infernex-vllm-pd-2p1d-01","app.kubernetes.io/name":"inference-backend","openfuyao.com/dpSize":"1","openfuyao.com/engine":"vllm","openfuyao.com/model":"qwen-qwen3-0.6b","openfuyao.com/pdRole":"aggregate","openfuyao.com/tpSize":"2"}}: field is immutable
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference delete deployments.apps vllm-pd-2p1d-01 
deployment.apps "vllm-pd-2p1d-01" deleted from ai-inference namespace
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: post-upgrade hooks failed: warning: Hook post-upgrade infernex/charts/pd-orchestrator/charts/resourcescalinggroup/templates/webhook-wait-hook.yaml failed: 1 error occurred:
        * jobs.batch "infernex-resourcescalinggroup-wait-webhook" is forbidden: unable to create new content in namespace scaling-system because it is being terminated


[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 
Error: UPGRADE FAILED: failed to create resource: namespaces "scaling-system" not found
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values 


Release "infernex" has been upgraded. Happy Helming!
NAME: infernex
LAST DEPLOYED: Mon Apr 13 14:33:47 2026
NAMESPACE: ai-inference
STATUS: deployed
REVISION: 27
TEST SUITE: None

istio httproute 错误

[root@master1 fuyao-26.3-rc3]# kubectl  -n ai-inference describe httproutes.gateway.networking.k8s.io qwen-qwen3-0.6b-httproute 
Name:         qwen-qwen3-0.6b-httproute
Namespace:    ai-inference
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=infernex-epp
              app.kubernetes.io/version=0.21.0
Annotations:  meta.helm.sh/release-name: infernex
              meta.helm.sh/release-namespace: ai-inference
API Version:  gateway.networking.k8s.io/v1
Kind:         HTTPRoute
Metadata:
  Creation Timestamp:  2026-04-13T06:31:19Z
  Generation:          1
  Resource Version:    3664436
  UID:                 21dca6e8-483a-4b03-8a78-82559c45a7e3
Spec:
  Parent Refs:
    Group:  gateway.networking.k8s.io
    Kind:   Gateway
    Name:   inference-gateway
  Rules:
    Backend Refs:
      Group:   inference.networking.k8s.io
      Kind:    InferencePool
      Name:    qwen-qwen3-0.6b
      Weight:  1
    Matches:
      Path:
        Type:   PathPrefix
        Value:  /
    Timeouts:
      Request:  300s
Status:
  Parents:
    Conditions:
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               Route was valid
      Observed Generation:   1
      Reason:                Accepted
      Status:                True
      Type:                  Accepted
      Last Transition Time:  2026-04-13T06:31:19Z
      Message:               InferencePool.Name invalid; the name of the InferencePool must be used, not the hostname.
      Observed Generation:   1
      Reason:                InvalidDestination
      Status:                False
      Type:                  ResolvedRefs
    Controller Name:         istio.io/gateway-controller
    Parent Ref:
      Group:  gateway.networking.k8s.io
      Kind:   Gateway
      Name:   inference-gateway
Events:       <none>

总结

  • helm 默认申请 2 张 310P ,需手动修改

    resources:
    limits:
      cpu: "8"
      huawei.com/Ascend310P: "1"
      memory: 64Gi
    requests:
      cpu: "4"
      huawei.com/Ascend310P: "1"
      memory: 32Gi
  • huggerface 下载需手动配置国内源

    - name: HF_ENDPOINT
      value: https://hf-mirror.com
  • vllm 启动参数需手动调整

    vllm serve Qwen/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --trust-remote-code \
    --enable-prefix-caching \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --enforce-eager \
    --dtype float16 \
    --max-num-batched-tokens 40960 \
    --data-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --block-size 128 \
    --kv-events-config '{"enable_kv_cache_events": true, "publisher":"zmq", "topic":"kv-events"}'
  • 默认服务采用 ClusterAPI 使用 Gateway istio 暴露服务,目前还没有正常工作
  • vllm pod 看到频繁打印
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK

Refs

最后修改:2026 年 04 月 13 日
如果觉得我的文章对你有用,请随意赞赏