Ryan
0d1a5e70bc
[Graph Optimization] Add full_cuda_graph to control subgraph split ( #6027 )
2026-01-14 11:43:59 +08:00
Yonghua Li
456637002d
[BugFix] fix cache transfer manager updating/clearing ( #5930 )
...
* [fix] fix cache transfer manager updating/clearing
* [fix] fix code style
* [fix] fix config
* [fix] fix engine client
* [fix] let worker update kv cache status signal
* [fix] update worker process
* [fix] fix clear/update for case if comm group is shutdown
* [fix] update dynamic weight manager
* [fix] fix port
* [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting
2026-01-13 05:09:29 -08:00
chenjian
6da06abc17
[Featue] Enable output caching by default ( #5987 )
...
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-13 19:34:21 +08:00
MingkunZhang
3772810b0a
[Metax][CI] update test_ernie_28b_vl.py image result keywords ( #6022 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-13 17:15:10 +08:00
MingkunZhang
5afeef69d6
[Metax][CI] update test_ernie_28b_vl.py ( #6019 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-13 15:44:43 +08:00
Jiaxin Sui
becd8c3803
[XPU][CI] Update XVLLM_PATH setup in run_xpu_ci_pytest.sh ( #6018 )
...
Download and set XVLLM_PATH from output.tar.gz instead of hardcoded path.
2026-01-13 15:42:52 +08:00
kevin
cb9f952f32
[BugFix] fix metrics cache tokens ( #6001 )
...
* fix metrics cache tokens
* update code
2026-01-12 22:50:56 -08:00
bukejiyu
8061f74773
[V1 Loader] Load safetensors weights in natural keyorder ( #6006 )
...
* sorted safetensor
* update
---------
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2026-01-12 21:27:20 -08:00
周周周
ad8d05a8de
[Optimization] Do not compute ATTN padding part in In Cuda graph mode ( #5985 )
2026-01-13 11:32:27 +08:00
ming1753
9c559d02d3
[BugFix] Fix insert_zmq_task_to_scheduler break bug ( #5960 )
...
* [BugFix] fix zmq bug
* fix bug
* formate
* fix test bug
* fix bug
2026-01-12 19:21:01 -08:00
GoldPancake
eb8ce36ae9
[BugFix] Fix entropy calculation issue in TP ( #5997 )
...
* fix entropy bugs
2026-01-13 11:10:46 +08:00
Copilot
fe7588d8f0
[Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation ( #6010 )
...
* Initial plan
* Update FastDeploy version from 2.3.2 to 2.3.3 in NVIDIA GPU installation docs
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-12 23:45:22 +08:00
YuBaoku
0d3dede273
[CI] Add fd-router build_task ( #5967 )
...
* [CI] Add fd-router build_task
2026-01-12 22:03:27 +08:00
sunxin
2533836dbb
[Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel ( #5880 )
...
* qk rmsnorm fused
* inplace
* glm
* fix
* add qknorm layer
* fix
* update
* fix qwen3 vl
* update rl baseline
* fix qwen3 vl moe
* test
* fix qwen vl moe rl
* fix
2026-01-12 05:10:21 -08:00
xjkmfa
1aa7e82924
[ci case]Check the chunking of the chat interface ( #5981 )
...
* Add ci case for min token and max token
* 【CI case】include total_tokens in the last packet of completion interface stream output
* [ci case] add Chunk segmentation check
* [ci case] add Chunk segmentation check
* [ci case] add Chunk segmentation check
* [ci case] add Chunk segmentation check
---------
Co-authored-by: xujing43 <xujing43@baidu.com >
2026-01-12 16:36:13 +08:00
ddchenhao66
fefc0b8382
[XPU]add ci test cast for P_EP4TP4 D_EP4TP1 ( #5988 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-12 16:30:15 +08:00
lzy
223b2f5d86
Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. ( #5917 )
2026-01-12 14:09:39 +08:00
Yonghua Li
60ee72f682
[BugFix] [MultiAPIServer] fix rdma script and port check for multi api server ( #5935 )
...
* [fix] fix rdma script and add more error log for multi api server
* [fix] log
* [fix] fix test_multi_api_server
* [fix] fix multi api server port check
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-12 10:38:52 +08:00
sunxin
17ef3920f3
remove decoder_num_blocks_device memset ( #5982 )
2026-01-10 21:22:06 +08:00
周周周
b8d9daa785
MLA clean code ( #5979 )
2026-01-10 21:05:00 +08:00
xiaoluomi
62bd92f9ba
dev_fix_mtp_forward_meta ( #5976 )
2026-01-10 00:40:56 +08:00
zhupengyang
9db48ecb34
[XPU] fix dp4 ( #5946 )
2026-01-09 20:36:53 +08:00
MingkunZhang
384ffd6952
[Metax] add ci test file & update run_ci_metax.sh ( #5975 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-09 18:47:06 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
CSWYF3634076
e6cdea4492
[Models] Qwen3VL and Qwen3VL-Moe CUDA graph Support ( #5962 )
...
* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support
* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v2
* [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v3
2026-01-09 17:09:02 +08:00
zccjjj
20de04e249
[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu ( #5878 )
2026-01-09 16:34:57 +08:00
Yuanle Liu
d4a386dfc4
Revert "Revert "[TSP] last_norm allgather move to model.py ( #5924 )" ( #5961 )" ( #5972 )
...
This reverts commit 8c3513a410 .
2026-01-09 15:58:22 +08:00
Yuanle Liu
8c3513a410
Revert "[TSP] last_norm allgather move to model.py ( #5924 )" ( #5961 )
...
This reverts commit 2bb838fed9 .
2026-01-09 15:20:40 +08:00
essos
1d20957340
[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part #5045 ( #5807 )
...
* update test code
* 减少 mock
* fix style
---------
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
2026-01-09 15:13:19 +08:00
GoldPancake
3ca99ab170
[Speculative Decoding] Return accepted tokens per head in response ( #5947 )
...
* adjust log level
* add accepted tokens per head
* fix ut
* fix
2026-01-09 14:32:08 +08:00
yangjianfengo1
16e1992eba
[Bugfix] Increase the shape of w4afp8 gemm ( #5957 )
...
* 增加w4afp8 shape
* 增加w4afp8 shape
* code style
2026-01-09 14:11:17 +08:00
MingkunZhang
cb09b52e66
[Metax] fix shape error & output garbled code when reasoning big picture or video ( #5965 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-09 13:41:45 +08:00
kevin
2d2b156252
[BugFix] fix dyc8 cache bug ( #5958 )
...
* fix dyc8 cache bug
* update code
2026-01-08 19:25:47 -08:00
Jiaxin Sui
e93a7d3b6b
Lock PaddlePaddle version in run_xpu_ci_pytest.sh ( #5964 )
...
Locked PaddlePaddle version to 20260107 due to compatibility issues with the updated xhpc framework.
2026-01-09 10:41:34 +08:00
YuBaoku
ff2eba1f43
[CI] Temporarily disable fp8_cases in base_tests ( #5963 )
...
* [CI] Temporarily disable fp8_cases in base_tests
2026-01-08 23:29:37 +08:00
YuBaoku
5218d40af6
[CI] Add clang-format 13.0.0 recommendation to pre_commit.sh
2026-01-08 21:47:19 +08:00
GoldPancake
e41d434548
[Bugfix] Fix entropy calculation bugs ( #5941 )
...
* fix entropy bugs
2026-01-08 20:57:35 +08:00
Jiang-Jia-Jun
b9663e5c89
Revise Pull Request guidelines and language section
...
Updated instructions for Pull Request titles and descriptions, changed language section to 'Others', and added notes on code style and pre-commit usage.
2026-01-08 19:26:05 +08:00
Copilot
6825903559
[BugFix] Fix misleading logging in worker_process for request counting ( #5939 )
...
* Initial plan
* Optimize logging in worker_process to accurately reflect request types
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Address feedback: rename to max_occupied_batch_index and simplify logging
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Improve comment clarity for batch request counting
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Fix code style: reorder imports with isort
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-08 16:36:22 +08:00
xiaoluomi
2bb838fed9
[TSP] last_norm allgather move to model.py ( #5924 )
...
* support_lastnorm_gather_split_dev
* support_lastnorm_gather_split_dev1
* support_lastnorm_gather_split_dev3
* support_lastnorm_gather_split_dev4
* support_lastnorm_gather_split_dev5
2026-01-07 23:36:33 -08:00
Bingoo
8e11d719f3
add flashinfer-python-paddle depend ( #5912 )
...
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-01-08 15:08:35 +08:00
GoldPancake
a1fc4e249e
[Bugfix] Fix mtp logprob hang problem when include stop_seq ( #5927 )
...
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
Jiaxin Sui
dc170e3005
[XPU][CI]Update CI workflow to include all file types ( #5943 )
...
Removed paths-ignore for markdown and text files.
2026-01-08 12:03:26 +08:00
FocusLuo
decbbb3933
[INTEL HPU] support only one release package of PaddleCustomDevice ( #5910 )
...
Signed-off-by: Luo, Focus <focus.luo@intel.com >
2026-01-08 11:57:13 +08:00
CSWYF3634076
d8fcb7c07d
[Models] Add Qwen3-VL Moe Model Support ( #5913 )
...
* [Model] add Qwen3vl moe model support
* [Model] add Qwen3vl moe model support remove log
* [Model] add Qwen3vl moe model support unittest
2026-01-08 11:36:42 +08:00
Daci
d8c6ba61f3
[BugFix] resource_manager_v1 lock PD ( #5616 )
...
* bugfix resource_manager_v1 lock PD
* with lock add_prefilled_request
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-08 10:02:54 +08:00
YuBaoku
5088d4acdb
[CI] Add daily build_linux jobs for CUDA 12.9 ( #5936 )
...
To extend the daily CI coverage by adding Linux build jobs for CUDA 12.9.
2026-01-07 23:20:11 +08:00
FocusLuo
64f910553e
[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking ( #5891 )
...
ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode
reference command line:
ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \
HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \
python -m fastdeploy.entrypoints.openai.api_server --model \
./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \
--port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \
--cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \
--load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \
--max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \
--graph-optimization-config '{"use_cudagraph":false}'
Signed-off-by: Luo, Focus <focus.luo@intel.com >
2026-01-07 21:31:53 +08:00
mouxin
0a92e96f20
[Feature] Add Golang-based Router for Request Scheduling and Load Balancing ( #5882 )
...
* [Feature] add golang router
* [Feature] add golang router
* [Feature] add golang router
* [Feature] add golang router
* [Feature] add golang router
* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing
* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing
* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing
* [Feature] Add Golang-based Router for Request Scheduling and Load Balancing
---------
Co-authored-by: mouxin <mouxin@baidu.com >
2026-01-07 21:28:08 +08:00
chenjian
925e7edd3c
[Bug fix] Limit multi-modal request to 1 ( #5901 )
2026-01-07 20:25:07 +08:00