Commit Graph

4408 Commits

Author SHA1 Message Date
Ryan 0d1a5e70bc [Graph Optimization] Add full_cuda_graph to control subgraph split (#6027) 2026-01-14 11:43:59 +08:00
Yonghua Li 456637002d [BugFix] fix cache transfer manager updating/clearing (#5930)
* [fix] fix cache transfer manager updating/clearing

* [fix] fix code style

* [fix] fix config

* [fix] fix engine client

* [fix] let worker update kv cache status signal

* [fix] update worker process

* [fix] fix clear/update for case if comm group is shutdown

* [fix] update dynamic weight manager

* [fix] fix port

* [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting
2026-01-13 05:09:29 -08:00
chenjian 6da06abc17 [Featue] Enable output caching by default (#5987)
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-13 19:34:21 +08:00
MingkunZhang 3772810b0a [Metax][CI] update test_ernie_28b_vl.py image result keywords (#6022)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-13 17:15:10 +08:00
MingkunZhang 5afeef69d6 [Metax][CI] update test_ernie_28b_vl.py (#6019)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-13 15:44:43 +08:00
Jiaxin Sui becd8c3803 [XPU][CI] Update XVLLM_PATH setup in run_xpu_ci_pytest.sh (#6018)
Download and set XVLLM_PATH from output.tar.gz instead of hardcoded path.
2026-01-13 15:42:52 +08:00
kevin cb9f952f32 [BugFix] fix metrics cache tokens (#6001)
* fix metrics cache tokens

* update code
2026-01-12 22:50:56 -08:00
bukejiyu 8061f74773 [V1 Loader] Load safetensors weights in natural keyorder (#6006)
* sorted safetensor

* update

---------

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2026-01-12 21:27:20 -08:00
周周周 ad8d05a8de [Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985) 2026-01-13 11:32:27 +08:00
ming1753 9c559d02d3 [BugFix] Fix insert_zmq_task_to_scheduler break bug (#5960)
* [BugFix] fix zmq bug

* fix bug

* formate

* fix test bug

* fix bug
2026-01-12 19:21:01 -08:00
GoldPancake eb8ce36ae9 [BugFix] Fix entropy calculation issue in TP (#5997)
* fix entropy bugs
2026-01-13 11:10:46 +08:00
Copilot fe7588d8f0 [Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation (#6010)
* Initial plan

* Update FastDeploy version from 2.3.2 to 2.3.3 in NVIDIA GPU installation docs

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-12 23:45:22 +08:00
YuBaoku 0d3dede273 [CI] Add fd-router build_task (#5967)
* [CI] Add fd-router build_task
2026-01-12 22:03:27 +08:00
sunxin 2533836dbb [Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel (#5880)
* qk rmsnorm fused

* inplace

* glm

* fix

* add qknorm layer

* fix

* update

* fix qwen3 vl

* update rl baseline

* fix qwen3 vl moe

* test

* fix qwen vl moe rl

* fix
2026-01-12 05:10:21 -08:00
xjkmfa 1aa7e82924 [ci case]Check the chunking of the chat interface (#5981)
* Add ci case for min token and max token

* 【CI case】include total_tokens in the last packet of completion interface stream output

* [ci case] add Chunk segmentation check

* [ci case] add Chunk segmentation check

* [ci case] add Chunk segmentation check

* [ci case] add Chunk segmentation check

---------

Co-authored-by: xujing43 <xujing43@baidu.com>
2026-01-12 16:36:13 +08:00
ddchenhao66 fefc0b8382 [XPU]add ci test cast for P_EP4TP4 D_EP4TP1 (#5988)
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-12 16:30:15 +08:00
lzy 223b2f5d86 Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917) 2026-01-12 14:09:39 +08:00
Yonghua Li 60ee72f682 [BugFix] [MultiAPIServer] fix rdma script and port check for multi api server (#5935)
* [fix] fix rdma script and add more error log for multi api server

* [fix] log

* [fix] fix test_multi_api_server

* [fix] fix multi api server port check

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-12 10:38:52 +08:00
sunxin 17ef3920f3 remove decoder_num_blocks_device memset (#5982) 2026-01-10 21:22:06 +08:00
周周周 b8d9daa785 MLA clean code (#5979) 2026-01-10 21:05:00 +08:00
xiaoluomi 62bd92f9ba dev_fix_mtp_forward_meta (#5976) 2026-01-10 00:40:56 +08:00
zhupengyang 9db48ecb34 [XPU] fix dp4 (#5946) 2026-01-09 20:36:53 +08:00
MingkunZhang 384ffd6952 [Metax] add ci test file & update run_ci_metax.sh (#5975)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-09 18:47:06 +08:00
xiaoxiaohehe001 00a01ae024 [Feature] Support redundant expert for eplb (#5918)
* [BugFix] support redundant expert for eplb

* support redundant expert for eplb

* support redundant expert for eplb

* update

* fix ci eplb
2026-01-09 17:13:24 +08:00
CSWYF3634076 e6cdea4492 [Models] Qwen3VL and Qwen3VL-Moe CUDA graph Support (#5962)
* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support

* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v2

* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v3
2026-01-09 17:09:02 +08:00
zccjjj 20de04e249 [XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu (#5878) 2026-01-09 16:34:57 +08:00
Yuanle Liu d4a386dfc4 Revert "Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)" (#5972)
This reverts commit 8c3513a410.
2026-01-09 15:58:22 +08:00
Yuanle Liu 8c3513a410 Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)
This reverts commit 2bb838fed9.
2026-01-09 15:20:40 +08:00
essos 1d20957340 [CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part #5045 (#5807)
* update test code

* 减少 mock

* fix style

---------

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-01-09 15:13:19 +08:00
GoldPancake 3ca99ab170 [Speculative Decoding] Return accepted tokens per head in response (#5947)
* adjust log level

* add accepted tokens per head

* fix ut

* fix
2026-01-09 14:32:08 +08:00
yangjianfengo1 16e1992eba [Bugfix] Increase the shape of w4afp8 gemm (#5957)
* 增加w4afp8 shape

* 增加w4afp8 shape

* code style
2026-01-09 14:11:17 +08:00
MingkunZhang cb09b52e66 [Metax] fix shape error & output garbled code when reasoning big picture or video (#5965)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-09 13:41:45 +08:00
kevin 2d2b156252 [BugFix] fix dyc8 cache bug (#5958)
* fix dyc8 cache bug

* update code
2026-01-08 19:25:47 -08:00
Jiaxin Sui e93a7d3b6b Lock PaddlePaddle version in run_xpu_ci_pytest.sh (#5964)
Locked PaddlePaddle version to 20260107 due to compatibility issues with the updated xhpc framework.
2026-01-09 10:41:34 +08:00
YuBaoku ff2eba1f43 [CI] Temporarily disable fp8_cases in base_tests (#5963)
* [CI] Temporarily disable fp8_cases in base_tests
2026-01-08 23:29:37 +08:00
YuBaoku 5218d40af6 [CI] Add clang-format 13.0.0 recommendation to pre_commit.sh 2026-01-08 21:47:19 +08:00
GoldPancake e41d434548 [Bugfix] Fix entropy calculation bugs (#5941)
* fix entropy bugs
2026-01-08 20:57:35 +08:00
Jiang-Jia-Jun b9663e5c89 Revise Pull Request guidelines and language section
Updated instructions for Pull Request titles and descriptions, changed language section to 'Others', and added notes on code style and pre-commit usage.
2026-01-08 19:26:05 +08:00
Copilot 6825903559 [BugFix] Fix misleading logging in worker_process for request counting (#5939)
* Initial plan

* Optimize logging in worker_process to accurately reflect request types

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Address feedback: rename to max_occupied_batch_index and simplify logging

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Improve comment clarity for batch request counting

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Fix code style: reorder imports with isort

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-08 16:36:22 +08:00
xiaoluomi 2bb838fed9 [TSP] last_norm allgather move to model.py (#5924)
* support_lastnorm_gather_split_dev

* support_lastnorm_gather_split_dev1

* support_lastnorm_gather_split_dev3

* support_lastnorm_gather_split_dev4

* support_lastnorm_gather_split_dev5
2026-01-07 23:36:33 -08:00
Bingoo 8e11d719f3 add flashinfer-python-paddle depend (#5912)
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-01-08 15:08:35 +08:00
GoldPancake a1fc4e249e [Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927)
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
Jiaxin Sui dc170e3005 [XPU][CI]Update CI workflow to include all file types (#5943)
Removed paths-ignore for markdown and text files.
2026-01-08 12:03:26 +08:00
FocusLuo decbbb3933 [INTEL HPU] support only one release package of PaddleCustomDevice (#5910)
Signed-off-by: Luo, Focus <focus.luo@intel.com>
2026-01-08 11:57:13 +08:00
CSWYF3634076 d8fcb7c07d [Models] Add Qwen3-VL Moe Model Support (#5913)
* [Model] add Qwen3vl moe model support

* [Model] add Qwen3vl moe model support remove log

* [Model] add Qwen3vl moe model support unittest
2026-01-08 11:36:42 +08:00
Daci d8c6ba61f3 [BugFix] resource_manager_v1 lock PD (#5616)
* bugfix resource_manager_v1 lock PD

* with lock add_prefilled_request

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-08 10:02:54 +08:00
YuBaoku 5088d4acdb [CI] Add daily build_linux jobs for CUDA 12.9 (#5936)
To extend the daily CI coverage by adding Linux build jobs for CUDA 12.9.
2026-01-07 23:20:11 +08:00
FocusLuo 64f910553e [INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking (#5891)
ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode

reference command line:
ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \
HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \
python -m fastdeploy.entrypoints.openai.api_server --model \
./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \
--port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \
--cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \
--load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \
--max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \
--graph-optimization-config '{"use_cudagraph":false}'

Signed-off-by: Luo, Focus <focus.luo@intel.com>
2026-01-07 21:31:53 +08:00
mouxin 0a92e96f20 [Feature] Add Golang-based Router for Request Scheduling and Load Balancing (#5882)
* [Feature] add golang router

* [Feature] add golang router

* [Feature] add golang router

* [Feature] add golang router

* [Feature] add golang router

* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing

* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing

* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing

* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-01-07 21:28:08 +08:00
chenjian 925e7edd3c [Bug fix] Limit multi-modal request to 1 (#5901) 2026-01-07 20:25:07 +08:00