AI推理集成部署(InferNex)是一个专为云原生环境下AI推理服务优化所设计的端到端集成部署方案。该方案基于Kubernetes Gateway API Inference Extension (GIE) 和主流LLM技术栈构建,通过Helm Chart将开源网关、智能路由、高性能推理后端、全局KVCache管理、扩缩容决策框架及推理可观测体系等核心加速模块无缝集成。它提供从请求接入、动态路由、推理执行到资源管理与监控的完整加速链路,旨在提升推理吞吐量并降低TTFT/TPOT时延,实现一站式的高效AI服务部署体验。
相关的文档如下:
- https://gitcode.com/openFuyao/sig-ai-inference/blob/main/docs/zh/ai_inference_infernex/user_guide/ai_inference_infernex.md#%E5%AE%89%E8%A3%85
- https://gitcode.com/openFuyao/InferNex
因为官方仅针对 910 做了验证,手头只有一张 310P ,理论上是可以跑起来,但是需要做一系列修改,本文记录部署遇到的各种问题及其解决方案。
部署后有几个 pod 一直起不来:
NAMESPACE NAME READY STATUS RESTARTS AGE
ai-inference vllm-pd-2p1d-01-decode-54cc4c7579-5h62w 0/1 Pending 0 5d18h
ai-inference vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg 0/1 Init:0/2 0 5d
ai-inference vllm-pd-2p1d-01-prefill-5c546dbcc-thmkd 0/1 Pending 0 5d18h
ai-inference vllm-pd-2p1d-01-prefill-fd68f87cf-jjdlc 0/1 Pending 0 5d hccn 问题
似乎是 hccn 找不到
[root@master1 ~]# kubectl -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg
Name: vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg
Namespace: ai-inference
Priority: 0
Service Account: default
Node: master1/10.17.30.131
Start Time: Tue, 07 Apr 2026 09:31:33 +0800
Labels: app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
app.kubernetes.io/name=inference-backend
openfuyao.com/dpSize=1
openfuyao.com/engine=vllm
openfuyao.com/model=qwen-qwen3-8b
openfuyao.com/pdGroupID=qwen3-8b-pd-01
openfuyao.com/pdRole=decode
openfuyao.com/ppSize=1
openfuyao.com/tpSize=1
pod-template-hash=6cd64bc69c
Annotations: checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
huawei.com/AscendReal: Ascend310P-0
huawei.com/kltDev: Ascend310P-0
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
mooncake-config-init:
Container ID:
Image: hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
Args:
set -e
CONFIG_PATH="/app/mooncake.json"
mkdir -p "$(dirname "$CONFIG_PATH")"
cat > /tmp/mooncake_config.tpl << 'EOF'
local_hostname: "$POD_IP"
metadata_server: "redis://redis-service:6379"
master_server_address: "mooncake-master-service:30089"
device_name: ""
protocol: "ascend"
global_segment_size: 42949672960
use_ascend_direct: true
EOF
POD_IP_VALUE="${POD_IP:-0.0.0.0}"
sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
POD_NAME: vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
POD_IP: (v1:status.podIP)
Mounts:
/app from mooncake-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
huggingface-download:
Container ID:
Image: cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
Image ID:
Port: <none>
Host Port: <none>
Command:
hf
download
Qwen/Qwen3-8B
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
HF_HUB_OFFLINE: 0
VLLM_USE_V1: 1
GLOO_SOCKET_IFNAME: eth0
TP_SOCKET_IFNAME: eth0
HCCL_SOCKET_IFNAME: eth0
MOONCAKE_CONFIG_PATH: /app/mooncake.json
Mounts:
/root/.cache from rootcache (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Containers:
decode-engine:
Container ID:
Image: hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
Image ID:
Port: 8000/TCP (decode-port)
Host Port: 0/TCP (decode-port)
Command:
/bin/bash
-c
Args:
# PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
# start vllm service
vllm serve Qwen/Qwen3-8B \
--served-model-name Qwen/Qwen3-8B \
--trust-remote-code \
--no-enable-prefix-caching \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 10000 \
--max-num-batched-tokens 40960 \
--data-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 8
huawei.com/Ascend310P: 1
memory: 64Gi
Requests:
cpu: 4
huawei.com/Ascend310P: 1
memory: 32Gi
Liveness: http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
Startup: http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
Environment:
HF_HUB_OFFLINE: 0
VLLM_USE_V1: 1
GLOO_SOCKET_IFNAME: eth0
TP_SOCKET_IFNAME: eth0
HCCL_SOCKET_IFNAME: eth0
MOONCAKE_CONFIG_PATH: /app/mooncake.json
POD_NAME: vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg (v1:metadata.name)
POD_IP: (v1:status.podIP)
Mounts:
/app from mooncake-config (ro)
/dev/shm from shm (rw)
/etc/ascend_install.info from installinfo (rw)
/etc/hccn.conf from hccnconf (rw)
/root/.cache from rootcache (rw)
/usr/bin/hccn_tool from hccntool (rw)
/usr/local/Ascend/driver/lib64 from lib64 (rw)
/usr/local/Ascend/driver/version.info from version (rw)
/usr/local/bin/npu-smi from npusmi (rw)
/usr/local/dcmi from dcmi (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jswfw (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
mooncake-config:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
shm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 24Gi
dcmi:
Type: HostPath (bare host directory volume)
Path: /usr/local/dcmi
HostPathType:
npusmi:
Type: HostPath (bare host directory volume)
Path: /usr/local/bin/npu-smi
HostPathType: File
lib64:
Type: HostPath (bare host directory volume)
Path: /usr/local/Ascend/driver/lib64
HostPathType:
version:
Type: HostPath (bare host directory volume)
Path: /usr/local/Ascend/driver/version.info
HostPathType: File
installinfo:
Type: HostPath (bare host directory volume)
Path: /etc/ascend_install.info
HostPathType: File
hccntool:
Type: HostPath (bare host directory volume)
Path: /usr/bin/hccn_tool
HostPathType: File
hccnconf:
Type: HostPath (bare host directory volume)
Path: /etc/hccn.conf
HostPathType: File
rootcache:
Type: HostPath (bare host directory volume)
Path: /home/llm_cache
HostPathType:
kube-api-access-jswfw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5d default-scheduler 0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Warning FailedScheduling 5d (x2 over 5d) default-scheduler 0/1 nodes are available: 1 Insufficient memory. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Warning FailedScheduling 9m32s default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint(s). no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Warning FailedScheduling 9m20s (x24 over 9m29s) default-scheduler 0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
Normal Scheduled 8m58s default-scheduler Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-7slqg to master1
Warning FailedMount 44s (x12 over 8m58s) kubelet MountVolume.SetUp failed for volume "hccntool" : hostPath type check failed: /usr/bin/hccn_tool is not a file
暂时绕过:
touch /usr/bin/hccn_tool
chmod +x /usr/bin/hccn_toolhuggingface-download 失败
[root@master1 ~]# kubectl -n ai-inference describe pod vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp
Name: vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp
Namespace: ai-inference
Priority: 0
Service Account: default
Node: master1/10.17.30.131
Start Time: Tue, 07 Apr 2026 09:45:19 +0800
Labels: app.kubernetes.io/instance=infernex-vllm-pd-2p1d-01
app.kubernetes.io/name=inference-backend
openfuyao.com/dpSize=1
openfuyao.com/engine=vllm
openfuyao.com/model=qwen-qwen3-8b
openfuyao.com/pdGroupID=qwen3-8b-pd-01
openfuyao.com/pdRole=decode
openfuyao.com/ppSize=1
openfuyao.com/tpSize=1
pod-template-hash=6cd64bc69c
Annotations: checksum/config: 476b32f01fc96ff2896aee7fce288cd2b58cdb2ac825d1a22518798806847a2c
cni.projectcalico.org/containerID: 32b3384131b69054ca45acc8afe5e272b0ca681ea6d0611b3fec7316e3532e80
cni.projectcalico.org/podIP: 192.168.137.155/32
cni.projectcalico.org/podIPs: 192.168.137.155/32
huawei.com/AscendReal: Ascend310P-0
huawei.com/kltDev: Ascend310P-0
Status: Pending
IP: 192.168.137.155
IPs:
IP: 192.168.137.155
Controlled By: ReplicaSet/vllm-pd-2p1d-01-decode-6cd64bc69c
Init Containers:
mooncake-config-init:
Container ID: containerd://4ede488dd17e33f3a980aee6fa4eac3093ad6366c3854fe85436f81f6e1df7bb
Image: hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
Image ID: hub.oepkgs.net/openfuyao/mikefarah/yq@sha256:4facc66fdcc785ec961ef7f2185f53f862f462eefe1d50c2eb311c2bb26823e3
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
Args:
set -e
CONFIG_PATH="/app/mooncake.json"
mkdir -p "$(dirname "$CONFIG_PATH")"
cat > /tmp/mooncake_config.tpl << 'EOF'
local_hostname: "$POD_IP"
metadata_server: "redis://redis-service:6379"
master_server_address: "mooncake-master-service:30089"
device_name: ""
protocol: "ascend"
global_segment_size: 42949672960
use_ascend_direct: true
EOF
POD_IP_VALUE="${POD_IP:-0.0.0.0}"
sed "s/\$POD_IP/${POD_IP_VALUE}/g" /tmp/mooncake_config.tpl | yq eval - -o=json > "$CONFIG_PATH"
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 07 Apr 2026 09:45:20 +0800
Finished: Tue, 07 Apr 2026 09:45:20 +0800
Ready: True
Restart Count: 0
Environment:
POD_NAME: vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
POD_IP: (v1:status.podIP)
Mounts:
/app from mooncake-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
huggingface-download:
Container ID: containerd://84e15b7d9e2f7382181e309c1558174ec48e58ad0ae14f92ae0dfff284da76e5
Image: cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
Image ID: cr.openfuyao.cn/openfuyao/huggingface-download@sha256:ac86348b5e6934a020c21c4f0ebf81b520194ba8e549f1847ecc7521b82d9a8d
Port: <none>
Host Port: <none>
Command:
hf
download
Qwen/Qwen3-8B
State: Running
Started: Tue, 07 Apr 2026 09:47:33 +0800
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 07 Apr 2026 09:45:20 +0800
Finished: Tue, 07 Apr 2026 09:47:32 +0800
Ready: False
Restart Count: 1
Environment:
HF_HUB_OFFLINE: 0
VLLM_USE_V1: 1
GLOO_SOCKET_IFNAME: eth0
TP_SOCKET_IFNAME: eth0
HCCL_SOCKET_IFNAME: eth0
MOONCAKE_CONFIG_PATH: /app/mooncake.json
Mounts:
/root/.cache from rootcache (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Containers:
decode-engine:
Container ID:
Image: hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0
Image ID:
Port: 8000/TCP (decode-port)
Host Port: 0/TCP (decode-port)
Command:
/bin/bash
-c
Args:
# PHYSICAL_DEVICES stands for the physical devices assigned to the container, use for vllm ascend 0.10.x
export PHYSICAL_DEVICES=$(ls /dev/davinci* 2>/dev/null | grep -o '[0-9]\+' | sort -n | paste -sd',' -)
# start vllm service
vllm serve Qwen/Qwen3-8B \
--served-model-name Qwen/Qwen3-8B \
--trust-remote-code \
--no-enable-prefix-caching \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 10000 \
--max-num-batched-tokens 40960 \
--data-parallel-size 1 \
--pipeline-parallel-size 1 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config '{"engine_id":"'$POD_NAME'","kv_connector":"MultiConnector","kv_connector_extra_config":{"connectors":[{"kv_buffer_device":"npu","kv_connector":"MooncakeConnectorV1","kv_connector_extra_config":{"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2},"use_ascend_direct":true},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"},{"kv_buffer_device":"npu","kv_connector":"AscendStoreConnector","kv_connector_extra_config":{"backend":"mooncake","decode":{"dp_size":1,"tp_size":1},"lookup_rpc_port":"0","prefill":{"dp_size":1,"tp_size":2}},"kv_parallel_size":1,"kv_port":"20001","kv_role":"kv_consumer"}],"decode":{"dp_size":1,"tp_size":1},"prefill":{"dp_size":1,"tp_size":2}},"kv_port":"20001","kv_rank":1,"kv_role":"kv_consumer"}'
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 8
huawei.com/Ascend310P: 1
memory: 64Gi
Requests:
cpu: 4
huawei.com/Ascend310P: 1
memory: 32Gi
Liveness: http-get http://:decode-port/health delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get http://:decode-port/v1/models delay=0s timeout=5s period=10s #success=1 #failure=3
Startup: http-get http://:decode-port/v1/models delay=30s timeout=5s period=30s #success=1 #failure=60
Environment:
HF_HUB_OFFLINE: 0
VLLM_USE_V1: 1
GLOO_SOCKET_IFNAME: eth0
TP_SOCKET_IFNAME: eth0
HCCL_SOCKET_IFNAME: eth0
MOONCAKE_CONFIG_PATH: /app/mooncake.json
POD_NAME: vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp (v1:metadata.name)
POD_IP: (v1:status.podIP)
Mounts:
/app from mooncake-config (ro)
/dev/shm from shm (rw)
/etc/ascend_install.info from installinfo (rw)
/etc/hccn.conf from hccnconf (rw)
/root/.cache from rootcache (rw)
/usr/bin/hccn_tool from hccntool (rw)
/usr/local/Ascend/driver/lib64 from lib64 (rw)
/usr/local/Ascend/driver/version.info from version (rw)
/usr/local/bin/npu-smi from npusmi (rw)
/usr/local/dcmi from dcmi (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jvbd (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
mooncake-config:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
shm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 24Gi
dcmi:
Type: HostPath (bare host directory volume)
Path: /usr/local/dcmi
HostPathType:
npusmi:
Type: HostPath (bare host directory volume)
Path: /usr/local/bin/npu-smi
HostPathType: File
lib64:
Type: HostPath (bare host directory volume)
Path: /usr/local/Ascend/driver/lib64
HostPathType:
version:
Type: HostPath (bare host directory volume)
Path: /usr/local/Ascend/driver/version.info
HostPathType: File
installinfo:
Type: HostPath (bare host directory volume)
Path: /etc/ascend_install.info
HostPathType: File
hccntool:
Type: HostPath (bare host directory volume)
Path: /usr/bin/hccn_tool
HostPathType: File
hccnconf:
Type: HostPath (bare host directory volume)
Path: /etc/hccn.conf
HostPathType: File
rootcache:
Type: HostPath (bare host directory volume)
Path: /home/llm_cache
HostPathType:
kube-api-access-9jvbd:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m20s default-scheduler 0/1 nodes are available: 1 Insufficient huawei.com/Ascend310P. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Normal Scheduled 2m15s default-scheduler Successfully assigned ai-inference/vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp to master1
Normal Pulled 2m14s kubelet Container image "hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1" already present on machine
Normal Created 2m14s kubelet Created container: mooncake-config-init
Normal Started 2m14s kubelet Started container mooncake-config-init
Normal Pulled 1s (x2 over 2m14s) kubelet Container image "cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2" already present on machine
Normal Created 1s (x2 over 2m14s) kubelet Created container: huggingface-download
Normal Started 1s (x2 over 2m14s) kubelet Started container huggingface-download
查看错误日志:
# 看当前这次的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download
# 看上一次失败的日志
kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download
[root@master1 ~]#
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-6cd64bc69c-rq9tp -c huggingface-download --previous
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
yield
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 250, in handle_request
resp = self._pool.handle_request(req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
raise exc from None
File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
response = connection.handle_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
raise exc
File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 78, in handle_request
stream = self._connect(request)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_sync/connection.py", line 124, in _connect
stream = self._network_backend.connect_tcp(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp
with map_exceptions(exc_map):
File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.11/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
raise to_exc(exc) from exc
httpcore.ConnectError: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 240, in snapshot_download
repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3285, in repo_info
return method(
^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 3020, in model_info
r = get_session().get(path, headers=headers, timeout=timeout, params=params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1053, in get
return self.request(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request
return self.send(request, auth=auth, follow_redirects=follow_redirects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send
response = self._send_handling_auth(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth
response = self._send_handling_redirects(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 979, in _send_handling_redirects
response = self._send_single_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_client.py", line 1014, in _send_single_request
response = transport.handle_request(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 249, in handle_request
with map_httpcore_exceptions():
File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.11/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/hf", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/hf.py", line 113, in main
app()
File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1152, in __call__
raise e
File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1135, in __call__
return get_command(self)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 795, in main
return _main(
^^^^^^
File "/usr/local/lib/python3.11/site-packages/typer/core.py", line 188, in _main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/click/core.py", line 824, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/typer/main.py", line 1514, in wrapper
return callback(**use_params)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 224, in download
_print_result(run_download())
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/cli/download.py", line 185, in run_download
return snapshot_download(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 89, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 324, in snapshot_download
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Got: ConnectError: [Errno 101] Network is unreachable
An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.
确认是网络问题:节点无法访问 HuggingFace(Network is unreachable),而且本地也没有缓存。
解决方案:换国内镜像源(推荐)
在 Deployment 的 huggingface-download init container 里加一个环境变量:
env:
- name: HF_ENDPOINT
value: "https://hf-mirror.com"最好给 decode-engine 也加上,否则报同样的错。
看日志可能没有任何输出
kubectl -n ai-inference logs deployments/vllm-pd-2p1d-01-decode huggingface-download -f此时查看 llm 目录大小即可,可以看到不断在变化:
$ watch -n 2 -d 'du -sh /home/llm_cache/'
426M /home/llm_cache/310p 运行报错
[root@master1 ~]# kubectl -n ai-inference logs vllm-pd-2p1d-01-decode-7d487c49cd-qw89v
Defaulted container "decode-engine" out of: decode-engine, mooncake-config-init (init), huggingface-download (init)
...
INFO 04-07 05:15:19 [__init__.py:217] Platform plugin ascend is activated
(EngineCore_DP0 pid=94) INFO 04-07 05:15:33 [ascend_config.py:55] Linear layer sharding enabled with config: None. Note: This feature works optimally with FLASHCOMM2 and DSA-CP enabled; using it without these features may result in significant performance degradation.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] EngineCore failed to start.
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] Traceback (most recent call last):
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] super().__init__(
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] self._init_executor()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] check_ascend_device_type()
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) ERROR 04-07 05:15:33 [patch_core.py:68] AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.
(EngineCore_DP0 pid=94) Process EngineCore_DP0:
(EngineCore_DP0 pid=94) Traceback (most recent call last):
(EngineCore_DP0 pid=94) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=94) self.run()
(EngineCore_DP0 pid=94) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=94) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 72, in run_engine_core
(EngineCore_DP0 pid=94) raise e
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/platform/patch_core.py", line 59, in run_engine_core
(EngineCore_DP0 pid=94) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=94) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=94) super().__init__(
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=94) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=94) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=94) self._init_executor()
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=94) self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm/vllm/v1/worker/worker_base.py", line 313, in init_worker
(EngineCore_DP0 pid=94) self.worker = worker_class(**kwargs)
(EngineCore_DP0 pid=94) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 116, in __init__
(EngineCore_DP0 pid=94) check_ascend_device_type()
(EngineCore_DP0 pid=94) File "/vllm-workspace/vllm-ascend/vllm_ascend/utils.py", line 708, in check_ascend_device_type
(EngineCore_DP0 pid=94) assert _ascend_device_type == cur_device_type, f"Current device type: {cur_device_type} does not match the installed version's device type: {_ascend_device_type}, please check your installation package."
(EngineCore_DP0 pid=94) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=94) AssertionError: Current device type: AscendDeviceType._310P does not match the installed version's device type: AscendDeviceType.A2, please check your installation package.
经过排查是 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是针对 910 构建的,找到了官方说明
310p 应该使用带有 310p 后缀的镜像。在 镜像仓库 寻找后替换为 quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler 尝试。
对比 sha256 发现 hub.oepkgs.net/openfuyao/ascend/vllm-ascend:v0.13.0 镜像是完全的 quay.io/ascend/vllm-ascend:v0.13.0 镜像,sha256 完全一致。
310p vllm-ascend 报错
但是替换为其他带有 310p 后缀的 image ,启动后会报错:
Every 1.0s: kubectl -n ai-inference get pod master1: Thu Apr 9 08:36:06 2026
NAME READY STATUS RESTARTS AGE
cache-indexer-deployment-65d5b449f6-x9l46 1/1 Running 0 17h
inference-gateway-istio-5f9b7d78f6-7kbrw 1/1 Running 26 (15h ago) 17h
infernex-epp-5cc456bd-4vvmv 1/1 Running 0 17h
mooncake-master-deployment-74cc5666b7-fr4fq 1/1 Running 0 17h
redis-server-deployment-67566b9765-m66lc 1/1 Running 0 17h
vllm-pd-2p1d-01-decode-7687ccb7b-vg98n 0/1 CrashLoopBackOff 161 (31s ago) 16h
vllm-pd-2p1d-01-prefill-66f7564d7f-tdd84 0/1 Pending 0 43h
vllm-pd-2p1d-01-prefill-fd68f87cf-mhtdk 0/1 Pending 0 40h
vllm-pd-2p1d-01-proxy-7ff4f59865-h8xbw 1/1 Running 0 17h
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(APIServer pid=1) File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/ascend_multi_connector.py", line 5, in <module>
(APIServer pid=1) from vllm_ascend.distributed.kv_transfer.kv_p2p.mooncake_layerwise_connector import MooncakeLayerwiseConnector
(APIServer pid=1) File "/vllm-workspace/vllm-ascend/vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py", line 25, in <module>
(APIServer pid=1) from mooncake.engine import TransferEngine # type: ignore
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ModuleNotFoundError: No module named 'mooncake'
(APIServer pid=1) [ERROR] 2026-04-08-08:58:03 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
[root@master1 ~]# kubectl -n ai-inference describe pod vllm-pd-2p1d-01-decode-7687ccb7b-vg98n | grep -i image:
Image: hub.oepkgs.net/openfuyao/mikefarah/yq:4.50.1
Image: cr.openfuyao.cn/openfuyao/huggingface-download:0.22.2
Image: quay.io/ascend/vllm-ascend:main-310p针对该问题,openfuyao 给出的方案如下:
@tl.s InferNex在310P环境部署问题排查:
vllm-ascend的310P镜像没有加入mooncake,所以在prefill/decode之间kvcache数据传输无法支持。
https://github.com/vllm-project/vllm-ascend/blob/main/Dockerfile.310p
建议使用聚合模式部署,可参考InferNex聚合模式示例,部署时将`inference-backend.services[0].kvTransferConfig` 配置项删除,即可不使用mooncake相关能力:
https://gitcode.com/openFuyao/InferNex/blob/0.22.2/examples/vllm-aggregated-random-values.yaml
vllm-ascend针对310P在线推理文档:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu
310P的推理还未验证过,可以尝试v0.13.0或者 v0.18.0rc1,这两个版本vllm官方有文档支持。aggregated 模式卡个数错误
环境中只有一张 310p,但是默认申请两张,需要修改两处,分别是资源申请个数 1 ,以及 vllm 启动参数 tensor_parallel_size 设定为 1 。
resources:
limits:
cpu: "8"
huawei.com/Ascend310P: "1"
memory: 64Gi
requests:
cpu: "4"
huawei.com/Ascend310P: "1"
memory: 32Gi
...
# start vllm service
vllm serve Qwen/Qwen3-8B \
--served-model-name Qwen/Qwen3-8B \
--trust-remote-code \
--enable-prefix-caching \
--port 8000 \
--tensor-parallel-size 1 \
bf16 数据类型报错
(EngineCore pid=35) [PID: 35] 2026-04-09-08:27:33.217.113 AclNN_Parameter_Error(EZ1001): Tensor self not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT,DT_FLOAT16,DT_INT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT8,DT_BOOL,DT_DOUBLE,].
(EngineCore pid=35)
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1) return runner.run(wrapper())
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:27:49 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 AI 说法,暂未查证:Ascend 310P 芯片不支持 bfloat16(bf16)数据类型,但 vLLM 在初始化 rotary embedding(RoPE)时使用了 torch.ones(..., dtype=torch.bfloat16),导致 ACL 算子报错。
310P 的算子库支持的浮点类型只有 float32 和 float16,不包含 bf16。
通过参数强制使用 float16 --dtype half 解决:
bashvllm serve Qwen/Qwen3-8B \
--dtype half \ # 强制使用 float16 而非 bfloat16
...其他参数npu_dynamic_quant 算子报错
解决以上问题后 pod 可以运行更久
(EngineCore pid=35) INFO 04-09 08:43:11 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen3-8B: 1.077223 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:08<00:32, 8.07s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:16<00:24, 8.33s/it]但最终依然 error ,跟踪报错信息如下:
(EngineCore pid=35) INFO 04-09 08:47:55 [default_loader.py:384] Loading weights took 35.12 seconds
(EngineCore pid=35) INFO 04-09 08:47:57 [model_runner_v1.py:2589] Loading model weights took 17.6043 GB
.(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/4de24ceb58/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=35) INFO 04-09 08:48:11 [backends.py:1048] Dynamo bytecode transform time: 13.07 s
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] super().__init__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] super().profile_run()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return super()._dummy_run(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] hidden_states = self.model(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._call_with_optional_nvtx_range(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return callable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] raise BackendCompilerFailed(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] self.configure_post_pass()
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwds)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return make_fx_tracer.trace(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._trace_inner(f, *args)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] t = dispatch_trace(
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return disable_fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] graph = tracer.trace(root, concrete_args) # type: ignore[arg-type]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] (self.create_arg(fn(*args)),),
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] out = f(*tensors) # type:ignore[call-arg]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return fn(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] out = func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] TraceBack (most recent call last):
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Tiling failed
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Tiling Failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] DynamicQuant launch kernel failed.
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35) ERROR 04-09 08:48:12 [core.py:1099]
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35) self.run()
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35) self._target(*self._args, **self._kwargs)
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35) raise e
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) super().__init__(
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) self.model_runner.profile_run()
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) super().profile_run()
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) return super()._dummy_run(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) outputs = self._model_forward(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) hidden_states = self.model(
(EngineCore pid=35) ^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) return forward_call(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/models/qwen3.py", line 322, in forward
(EngineCore pid=35) hidden_states = self.model(
(EngineCore pid=35) ^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 597, in __call__
(EngineCore pid=35) output = TorchCompileWithNoGuardsWrapper.__call__(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 182, in __call__
(EngineCore pid=35) return self._call_with_optional_nvtx_range(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=35) return callable_fn(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 845, in compile_wrapper
(EngineCore pid=35) raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2196, in _call_user_compiler
(EngineCore pid=35) raise BackendCompilerFailed(
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 2171, in _call_user_compiler
(EngineCore pid=35) compiled_fn = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
(EngineCore pid=35) compiled_gm = compiler_fn(gm, example_inputs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/__init__.py", line 2437, in __call__
(EngineCore pid=35) return self.compiler_fn(model_, inputs_, **self.kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) return func(*args, **kwds)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 1063, in __call__
(EngineCore pid=35) self.configure_post_pass()
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/backends.py", line 847, in configure_post_pass
(EngineCore pid=35) self.pass_manager.configure(self.vllm_config)
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/graph_fusion_pass_manager.py", line 55, in configure
(EngineCore pid=35) self.passes.append(AddRMSNormQuantFusionPass(config))
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 493, in __init__
(EngineCore pid=35) AddRMSNormDynamicQuantPattern(vllm_config, eps=eps).register(self.pattern_match_passes)
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/base_pattern.py", line 49, in register
(EngineCore pid=35) pm.register_replacement(pattern_fn, replacement_fn, example_inputs, pm.fwd_only, pm_pass)
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1552, in register_replacement
(EngineCore pid=35) pattern, gm = gen_pattern_and_search_gm(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 81, in inner
(EngineCore pid=35) return func(*args, **kwds)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 1760, in gen_pattern_and_search_gm
(EngineCore pid=35) search_gm = trace_fn(search_fn, flat_inputs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_inductor/pattern_matcher.py", line 2115, in fwd_only
(EngineCore pid=35) gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2429, in wrapped
(EngineCore pid=35) return make_fx_tracer.trace(f, *args)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2356, in trace
(EngineCore pid=35) return self._trace_inner(f, *args)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 2318, in _trace_inner
(EngineCore pid=35) t = dispatch_trace(
(EngineCore pid=35) ^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_compile.py", line 53, in inner
(EngineCore pid=35) return disable_fn(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) return fn(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1303, in dispatch_trace
(EngineCore pid=35) graph = tracer.trace(root, concrete_args) # type: ignore[arg-type]
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=35) return fn(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/_symbolic_trace.py", line 868, in trace
(EngineCore pid=35) (self.create_arg(fn(*args)),),
(EngineCore pid=35) ^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1361, in wrapped
(EngineCore pid=35) out = f(*tensors) # type:ignore[call-arg]
(EngineCore pid=35) ^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/compilation/passes/norm_quant_fusion_pass.py", line 300, in pattern
(EngineCore pid=35) quantized_output = torch.ops.npu.npu_dynamic_quant(out0)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) return self._op(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1409, in __torch_function__
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) return self._op(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_stats.py", line 28, in wrapper
(EngineCore pid=35) return fn(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 1534, in __torch_dispatch__
(EngineCore pid=35) return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 994, in proxy_call
(EngineCore pid=35) out = func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 841, in __call__
(EngineCore pid=35) return self._op(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0xfffec83c17d0>' raised:
(EngineCore pid=35) RuntimeError: npu_dynamic_quant:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:82 NPU function error: call aclnnDynamicQuantV2 failed, error code is 561103
(EngineCore pid=35) [ERROR] 2026-04-09-08:48:12 (PID:35, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(EngineCore pid=35) [PID: 35] 2026-04-09-08:48:12.623.243 AclNN_Parameter_Error(EZ1001): DynamicQuant launch kernel failed.
(EngineCore pid=35) TraceBack (most recent call last):
(EngineCore pid=35) Tiling failed
(EngineCore pid=35) Tiling Failed.
(EngineCore pid=35) Kernel GetWorkspace failed. opType: 21
(EngineCore pid=35) DynamicQuant launch kernel failed.
(EngineCore pid=35)
(EngineCore pid=35)
(EngineCore pid=35) Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(EngineCore pid=35)
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1) return runner.run(wrapper())
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-09-08:48:30 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据 ai 的说法:根本原因:norm_quant 融合 Pass 在编译阶段向 310P 发起 npu_dynamic_quant 算子,而 310P 不支持该动态量化算子(或当前 CANN 版本不兼容),导致 Tiling 失败。
openfuyao 社区给出的回应如下:
@tl.s
看了一下报错日志,应该是310卡不支持vllm-ascend默认开启的算子 DynamicQuantV2,可以加上启动配置项 --enforce-eager 和 --no-quant 尝试一下。
在InferNex中,默认未直接提供的vllm启动参数可以在 inference-backend.services[0].pd.prefill/decode.extraArgs添加。例如:
extraArgs:
- "--enforce-eager"
- "--no-quant "若还是不行,可以尝试更换模型,按照官方310P文档内的示例部署,如 Qwen2.5-7B-Instruct。
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu
OOM
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference logs deployments/vllm-pd-2p1d-01 -f
Defaulted container "aggregated-engine" out of: aggregated-engine, huggingface-download (init)
INFO 04-13 06:21:29 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:21:29 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:21:29 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:21:29 [__init__.py:239] Platform plugin ascend is activated
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
INFO 04-13 06:21:42 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
WARNING 04-13 06:21:44 [__init__.py:80] The quantization method 'ascend' already exists and will be overwritten by the quantization config <class 'vllm_ascend._310p.quantization.modelslim_config.AscendModelSlimConfig310'>.
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.0
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] █▄█▀ █ █ █ █ model Qwen/Qwen2.5-7B-Instruct
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:297]
(APIServer pid=1) INFO 04-13 06:21:45 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-7B-Instruct', 'model': 'Qwen/Qwen2.5-7B-Instruct', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 4096, 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen2.5-7B-Instruct'], 'block_size': 128, 'gpu_memory_utilization': 0.8, 'enable_prefix_caching': True, 'max_num_batched_tokens': 40960, 'kv_events_config': KVEventsConfig(enable_kv_cache_events=True, publisher='zmq', endpoint='tcp://*:5557', replay_endpoint=None, buffer_steps=10000, hwm=100000, max_queue_size=100000, topic='kv-events')}
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT_HTTP_API
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_PORT
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_PD_2P1D_01_SERVICE_HOST
(APIServer pid=1) WARNING 04-13 06:21:45 [envs.py:1717] Unknown vLLM environment variable detected: VLLM_USE_V1
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:533] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1) WARNING 04-13 06:22:18 [model.py:1920] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 04-13 06:22:18 [model.py:1582] Using max model len 4096
(APIServer pid=1) INFO 04-13 06:22:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=40960.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:788] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 04-13 06:22:18 [vllm.py:799] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 04-13 06:22:18 [vllm.py:964] Cudagraph is disabled under eager mode
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:749] Parameter '--disable-cascade-attn' is a GPU-specific feature. Resetting to False for Ascend.
(APIServer pid=1) WARNING 04-13 06:22:26 [platform.py:838] Ignored parameter 'disable_flashinfer_prefill'. This is a GPU-specific feature not supported on Ascend. Resetting to False.
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:425] Dynamic EPLB is False
(APIServer pid=1) INFO 04-13 06:22:26 [ascend_config.py:426] The number of redundant experts is 0
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:297] Compilation disabled, using eager mode by default
(APIServer pid=1) INFO 04-13 06:22:26 [platform.py:502] Set PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
(APIServer pid=1) INFO 04-13 06:22:26 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant
INFO 04-13 06:22:53 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:22:53 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:22:53 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:22:53 [__init__.py:239] Platform plugin ascend is activated
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.netloader.netloader.ModelNetLoaderElastic'>` with load format `netloader`
(EngineCore pid=35) INFO 04-13 06:23:03 [__init__.py:110] Registered model loader `<class 'vllm_ascend.model_loader.rfork.rfork_loader.RForkModelLoader'>` with load format `rfork`
(EngineCore pid=35) INFO 04-13 06:23:03 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='Qwen/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'vllm_ascend.compilation.compiler_interface.AscendCompiler', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [40960], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=35) WARNING 04-13 06:23:08 [camem.py:66] Failed to import vllm_ascend_C:/vllm-workspace/vllm-ascend/vllm_ascend/vllm_ascend_C.cpython-311-aarch64-linux-gnu.so: undefined symbol: _ZN9pp_matmul17GetPpMatmulTilingERKNS_10MatMulInfoERKNS_12HardwareInfoERjRNS_18PpMatmulTilingDataE. Sleep mode will be disabled.
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:425] Dynamic EPLB is False
(EngineCore pid=35) INFO 04-13 06:23:08 [ascend_config.py:426] The number of redundant experts is 0
INFO 04-13 06:23:22 [__init__.py:44] Available plugins for group vllm.platform_plugins:
INFO 04-13 06:23:22 [__init__.py:46] - ascend -> vllm_ascend:register
INFO 04-13 06:23:22 [__init__.py:49] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 04-13 06:23:22 [__init__.py:239] Platform plugin ascend is activated
....(EngineCore pid=35) INFO 04-13 06:24:35 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.137.164:42089 backend=hccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) INFO 04-13 06:24:36 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore pid=35) WARNING 04-13 06:24:36 [worker.py:306] Bind cpus failed in rank0: Can not get running npu info. Skip binding cpu.
(EngineCore pid=35) INFO 04-13 06:24:37 [model_runner_v1.py:2562] Starting to load model Qwen/Qwen2.5-7B-Instruct...
(EngineCore pid=35) INFO 04-13 06:24:56 [weight_utils.py:574] Time spent downloading weights for Qwen/Qwen2.5-7B-Instruct: 4.228922 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:10<00:30, 10.12s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:20<00:20, 10.32s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:30<00:10, 10.21s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00, 9.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:40<00:00, 10.08s/it]
(EngineCore pid=35)
(EngineCore pid=35) INFO 04-13 06:25:46 [default_loader.py:384] Loading weights took 40.53 seconds
(EngineCore pid=35) INFO 04-13 06:25:48 [model_runner_v1.py:2589] Loading model weights took 16.2391 GB
.(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] EngineCore failed to start.
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] super().__init__(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] self.model_runner.profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] super().profile_run()
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return super()._dummy_run(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return func(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] outputs = self._model_forward(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] hidden_states = self.model(
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self.forward(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] hidden_states = self.mlp(hidden_states)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return forward_call(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return super().forward(input_)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return self._op(*args, **kwargs)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) ERROR 04-13 06:25:51 [core.py:1099] RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(EngineCore pid=35) Process EngineCore:
(EngineCore pid=35) Traceback (most recent call last):
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=35) self.run()
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/multiprocessing/process.py", line 108, in run
(EngineCore pid=35) self._target(*self._args, **self._kwargs)
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=35) raise e
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=35) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=35) super().__init__(
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=35) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 245, in _initialize_kv_caches
(EngineCore pid=35) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=35) return self.collective_rpc("determine_available_memory")
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=35) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/worker_310p.py", line 69, in determine_available_memory
(EngineCore pid=35) self.model_runner.profile_run()
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2550, in profile_run
(EngineCore pid=35) super().profile_run()
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 5516, in profile_run
(EngineCore pid=35) hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/_310p/model_runner_310p.py", line 170, in _dummy_run
(EngineCore pid=35) return super()._dummy_run(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=35) return func(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2492, in _dummy_run
(EngineCore pid=35) outputs = self._model_forward(
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1818, in _model_forward
(EngineCore pid=35) hidden_states = self.model(
(EngineCore pid=35) ^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) return forward_call(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 583, in forward
(EngineCore pid=35) hidden_states = self.model(
(EngineCore pid=35) ^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/compilation/decorators.py", line 439, in __call__
(EngineCore pid=35) return self.forward(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 444, in forward
(EngineCore pid=35) hidden_states, residual = layer(positions, hidden_states, residual)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) return forward_call(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 311, in forward
(EngineCore pid=35) hidden_states = self.mlp(hidden_states)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) return forward_call(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/models/qwen2.py", line 114, in forward
(EngineCore pid=35) gate_up, _ = self.gate_up_proj(x)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=35) return self._call_impl(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=35) return forward_call(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/linear.py", line 215, in forward
(EngineCore pid=35) return super().forward(input_)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 582, in forward
(EngineCore pid=35) output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/layers/linear.py", line 228, in apply
(EngineCore pid=35) return dispatch_unquantized_gemm()(layer, x, layer.weight, bias)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 55, in default_unquantized_gemm
(EngineCore pid=35) return torch.ops.vllm.unquantized_gemm(x, weight, bias)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) return self._op(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm/vllm/model_executor/parameter.py", line 126, in __torch_function__
(EngineCore pid=35) return super().__torch_function__(func, types, args, kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=35) return self._op(*args, **kwargs)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) File "/vllm-workspace/vllm-ascend/vllm_ascend/patch/worker/patch_unquantized_gemm.py", line 27, in unquantized_gemm
(EngineCore pid=35) return torch.nn.functional.linear(x, weight, bias)
(EngineCore pid=35) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=35) RuntimeError: NPU out of memory. Tried to allocate 2.89 GiB (NPU 0; 21.02 GiB total capacity; 17.36 GiB already allocated; 17.36 GiB current active; 2.15 GiB free; 17.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/python3.11.14/bin/vllm", line 6, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 92, in run
(APIServer pid=1) return runner.run(wrapper())
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 656, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 670, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 103, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 144, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=1) super().__init__(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) File "/usr/local/python3.11.14/lib/python3.11/contextlib.py", line 144, in __exit__
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(APIServer pid=1) [ERROR] 2026-04-13-06:26:05 (PID:1, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
(APIServer pid=1) sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
根据官网说明,似乎单张 310P 只能跑个 0.6B
Run the following script to start the vLLM server on NPU (Qwen3-0.6B:1 card, Qwen2.5-7B-Instruct:2 cards, Pangu-Pro-MoE-72B: 8 cards)
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/hardwares/310p.html#online-inference-on-npu
而且需要一些参数
vllm serve Qwen/Qwen3-0.6B \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--enforce-eager \
--dtype float16Helm Update 报错
每次更新 helm 都需要经历以下几个回合,才能成功:
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values
Error: UPGRADE FAILED: cannot patch "vllm-pd-2p1d-01" with kind Deployment: Deployment.apps "vllm-pd-2p1d-01" is invalid: spec.selector: Invalid value: {"matchLabels":{"app.kubernetes.io/instance":"infernex-vllm-pd-2p1d-01","app.kubernetes.io/name":"inference-backend","openfuyao.com/dpSize":"1","openfuyao.com/engine":"vllm","openfuyao.com/model":"qwen-qwen3-0.6b","openfuyao.com/pdRole":"aggregate","openfuyao.com/tpSize":"2"}}: field is immutable
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference delete deployments.apps vllm-pd-2p1d-01
deployment.apps "vllm-pd-2p1d-01" deleted from ai-inference namespace
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values
Error: UPGRADE FAILED: post-upgrade hooks failed: warning: Hook post-upgrade infernex/charts/pd-orchestrator/charts/resourcescalinggroup/templates/webhook-wait-hook.yaml failed: 1 error occurred:
* jobs.batch "infernex-resourcescalinggroup-wait-webhook" is forbidden: unable to create new content in namespace scaling-system because it is being terminated
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values
Error: UPGRADE FAILED: failed to create resource: namespaces "scaling-system" not found
[root@master1 fuyao-26.3-rc3]# helm upgrade --install -n ai-inference infernex ./infernex -f infernex.values
Release "infernex" has been upgraded. Happy Helming!
NAME: infernex
LAST DEPLOYED: Mon Apr 13 14:33:47 2026
NAMESPACE: ai-inference
STATUS: deployed
REVISION: 27
TEST SUITE: Noneistio httproute 错误
[root@master1 fuyao-26.3-rc3]# kubectl -n ai-inference describe httproutes.gateway.networking.k8s.io qwen-qwen3-0.6b-httproute
Name: qwen-qwen3-0.6b-httproute
Namespace: ai-inference
Labels: app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=infernex-epp
app.kubernetes.io/version=0.21.0
Annotations: meta.helm.sh/release-name: infernex
meta.helm.sh/release-namespace: ai-inference
API Version: gateway.networking.k8s.io/v1
Kind: HTTPRoute
Metadata:
Creation Timestamp: 2026-04-13T06:31:19Z
Generation: 1
Resource Version: 3664436
UID: 21dca6e8-483a-4b03-8a78-82559c45a7e3
Spec:
Parent Refs:
Group: gateway.networking.k8s.io
Kind: Gateway
Name: inference-gateway
Rules:
Backend Refs:
Group: inference.networking.k8s.io
Kind: InferencePool
Name: qwen-qwen3-0.6b
Weight: 1
Matches:
Path:
Type: PathPrefix
Value: /
Timeouts:
Request: 300s
Status:
Parents:
Conditions:
Last Transition Time: 2026-04-13T06:31:19Z
Message: Route was valid
Observed Generation: 1
Reason: Accepted
Status: True
Type: Accepted
Last Transition Time: 2026-04-13T06:31:19Z
Message: InferencePool.Name invalid; the name of the InferencePool must be used, not the hostname.
Observed Generation: 1
Reason: InvalidDestination
Status: False
Type: ResolvedRefs
Controller Name: istio.io/gateway-controller
Parent Ref:
Group: gateway.networking.k8s.io
Kind: Gateway
Name: inference-gateway
Events: <none>
总结
helm 默认申请 2 张 310P ,需手动修改
resources: limits: cpu: "8" huawei.com/Ascend310P: "1" memory: 64Gi requests: cpu: "4" huawei.com/Ascend310P: "1" memory: 32Gihuggerface 下载需手动配置国内源
- name: HF_ENDPOINT value: https://hf-mirror.comvllm 启动参数需手动调整
vllm serve Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --trust-remote-code \ --enable-prefix-caching \ --port 8000 \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --enforce-eager \ --dtype float16 \ --max-num-batched-tokens 40960 \ --data-parallel-size 1 \ --gpu-memory-utilization 0.8 \ --block-size 128 \ --kv-events-config '{"enable_kv_cache_events": true, "publisher":"zmq", "topic":"kv-events"}'- 默认服务采用 ClusterAPI 使用 Gateway istio 暴露服务,目前还没有正常工作
- vllm pod 看到频繁打印
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 192.168.137.151:51092 - "GET /metrics HTTP/1.1" 200 OK