FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
Ryan	0d1a5e70bc	[Graph Optimization] Add `full_cuda_graph` to control subgraph split (#6027 )	2026-01-14 11:43:59 +08:00
Yonghua Li	456637002d	[BugFix] fix cache transfer manager updating/clearing (#5930 ) * [fix] fix cache transfer manager updating/clearing * [fix] fix code style * [fix] fix config * [fix] fix engine client * [fix] let worker update kv cache status signal * [fix] update worker process * [fix] fix clear/update for case if comm group is shutdown * [fix] update dynamic weight manager * [fix] fix port * [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting	2026-01-13 05:09:29 -08:00
chenjian	6da06abc17	[Featue] Enable output caching by default (#5987 ) Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-13 19:34:21 +08:00
MingkunZhang	3772810b0a	[Metax][CI] update test_ernie_28b_vl.py image result keywords (#6022 ) Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>	2026-01-13 17:15:10 +08:00
MingkunZhang	5afeef69d6	[Metax][CI] update test_ernie_28b_vl.py (#6019 ) Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>	2026-01-13 15:44:43 +08:00
Jiaxin Sui	becd8c3803	[XPU][CI] Update XVLLM_PATH setup in run_xpu_ci_pytest.sh (#6018 ) Download and set XVLLM_PATH from output.tar.gz instead of hardcoded path.	2026-01-13 15:42:52 +08:00
kevin	cb9f952f32	[BugFix] fix metrics cache tokens (#6001 ) * fix metrics cache tokens * update code	2026-01-12 22:50:56 -08:00
bukejiyu	8061f74773	[V1 Loader] Load safetensors weights in natural keyorder (#6006 ) * sorted safetensor * update --------- Co-authored-by: Yuanle Liu <yuanlehome@163.com>	2026-01-12 21:27:20 -08:00
周周周	ad8d05a8de	[Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985 )	2026-01-13 11:32:27 +08:00
ming1753	9c559d02d3	[BugFix] Fix insert_zmq_task_to_scheduler break bug (#5960 ) * [BugFix] fix zmq bug * fix bug * formate * fix test bug * fix bug	2026-01-12 19:21:01 -08:00
GoldPancake	eb8ce36ae9	[BugFix] Fix entropy calculation issue in TP (#5997 ) * fix entropy bugs	2026-01-13 11:10:46 +08:00
Copilot	fe7588d8f0	[Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation (#6010 ) * Initial plan * Update FastDeploy version from 2.3.2 to 2.3.3 in NVIDIA GPU installation docs Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-12 23:45:22 +08:00
YuBaoku	0d3dede273	[CI] Add fd-router build_task (#5967 ) * [CI] Add fd-router build_task	2026-01-12 22:03:27 +08:00
sunxin	2533836dbb	[Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel (#5880 ) * qk rmsnorm fused * inplace * glm * fix * add qknorm layer * fix * update * fix qwen3 vl * update rl baseline * fix qwen3 vl moe * test * fix qwen vl moe rl * fix	2026-01-12 05:10:21 -08:00
xjkmfa	1aa7e82924	[ci case]Check the chunking of the chat interface (#5981 ) * Add ci case for min token and max token * 【CI case】include total_tokens in the last packet of completion interface stream output * [ci case] add Chunk segmentation check * [ci case] add Chunk segmentation check * [ci case] add Chunk segmentation check * [ci case] add Chunk segmentation check --------- Co-authored-by: xujing43 <xujing43@baidu.com>	2026-01-12 16:36:13 +08:00
ddchenhao66	fefc0b8382	[XPU]add ci test cast for P_EP4TP4 D_EP4TP1 (#5988 ) Co-authored-by: ddchenhao66 <dhaochen163.com>	2026-01-12 16:30:15 +08:00
lzy	223b2f5d86	Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917 )	2026-01-12 14:09:39 +08:00
Yonghua Li	60ee72f682	[BugFix] [MultiAPIServer] fix rdma script and port check for multi api server (#5935 ) * [fix] fix rdma script and add more error log for multi api server * [fix] log * [fix] fix test_multi_api_server * [fix] fix multi api server port check --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-12 10:38:52 +08:00
sunxin	17ef3920f3	remove decoder_num_blocks_device memset (#5982 )	2026-01-10 21:22:06 +08:00
周周周	b8d9daa785	MLA clean code (#5979 )	2026-01-10 21:05:00 +08:00
xiaoluomi	62bd92f9ba	dev_fix_mtp_forward_meta (#5976 )	2026-01-10 00:40:56 +08:00
zhupengyang	9db48ecb34	[XPU] fix dp4 (#5946 )	2026-01-09 20:36:53 +08:00
MingkunZhang	384ffd6952	[Metax] add ci test file & update run_ci_metax.sh (#5975 ) Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>	2026-01-09 18:47:06 +08:00
xiaoxiaohehe001	00a01ae024	[Feature] Support redundant expert for eplb (#5918 ) * [BugFix] support redundant expert for eplb * support redundant expert for eplb * support redundant expert for eplb * update * fix ci eplb	2026-01-09 17:13:24 +08:00
CSWYF3634076	e6cdea4492	[Models] Qwen3VL and Qwen3VL-Moe CUDA graph Support (#5962 ) * [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support * [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v2 * [Models] add Qwen3VL and Qwen3VL-Moe CUDA graph support v3	2026-01-09 17:09:02 +08:00
zccjjj	20de04e249	[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu (#5878 )	2026-01-09 16:34:57 +08:00
Yuanle Liu	d4a386dfc4	Revert "Revert "[TSP] last_norm allgather move to model.py (#5924 )" (#5961 )" (#5972 ) This reverts commit `8c3513a410`.	2026-01-09 15:58:22 +08:00
Yuanle Liu	8c3513a410	Revert "[TSP] last_norm allgather move to model.py (#5924 )" (#5961 ) This reverts commit `2bb838fed9`.	2026-01-09 15:20:40 +08:00
essos	1d20957340	[CI]【Hackathon 9th Sprint No.50】NO.50 功能模块 fastdeploy/entrypoints/engine_client.py 单测补充 -part #5045 (#5807 ) * update test code * 减少 mock * fix style --------- Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-01-09 15:13:19 +08:00
GoldPancake	3ca99ab170	[Speculative Decoding] Return accepted tokens per head in response (#5947 ) * adjust log level * add accepted tokens per head * fix ut * fix	2026-01-09 14:32:08 +08:00
yangjianfengo1	16e1992eba	[Bugfix] Increase the shape of w4afp8 gemm (#5957 ) * 增加w4afp8 shape * 增加w4afp8 shape * code style	2026-01-09 14:11:17 +08:00
MingkunZhang	cb09b52e66	[Metax] fix shape error & output garbled code when reasoning big picture or video (#5965 ) Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>	2026-01-09 13:41:45 +08:00
kevin	2d2b156252	[BugFix] fix dyc8 cache bug (#5958 ) * fix dyc8 cache bug * update code	2026-01-08 19:25:47 -08:00
Jiaxin Sui	e93a7d3b6b	Lock PaddlePaddle version in run_xpu_ci_pytest.sh (#5964 ) Locked PaddlePaddle version to 20260107 due to compatibility issues with the updated xhpc framework.	2026-01-09 10:41:34 +08:00
YuBaoku	ff2eba1f43	[CI] Temporarily disable fp8_cases in base_tests (#5963 ) * [CI] Temporarily disable fp8_cases in base_tests	2026-01-08 23:29:37 +08:00
YuBaoku	5218d40af6	[CI] Add clang-format 13.0.0 recommendation to pre_commit.sh	2026-01-08 21:47:19 +08:00
GoldPancake	e41d434548	[Bugfix] Fix entropy calculation bugs (#5941 ) * fix entropy bugs	2026-01-08 20:57:35 +08:00
Jiang-Jia-Jun	b9663e5c89	Revise Pull Request guidelines and language section Updated instructions for Pull Request titles and descriptions, changed language section to 'Others', and added notes on code style and pre-commit usage.	2026-01-08 19:26:05 +08:00
Copilot	6825903559	[BugFix] Fix misleading logging in worker_process for request counting (#5939 ) * Initial plan * Optimize logging in worker_process to accurately reflect request types Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Address feedback: rename to max_occupied_batch_index and simplify logging Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Improve comment clarity for batch request counting Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Fix code style: reorder imports with isort Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-08 16:36:22 +08:00
xiaoluomi	2bb838fed9	[TSP] last_norm allgather move to model.py (#5924 ) * support_lastnorm_gather_split_dev * support_lastnorm_gather_split_dev1 * support_lastnorm_gather_split_dev3 * support_lastnorm_gather_split_dev4 * support_lastnorm_gather_split_dev5	2026-01-07 23:36:33 -08:00
Bingoo	8e11d719f3	add flashinfer-python-paddle depend (#5912 ) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-01-08 15:08:35 +08:00
GoldPancake	a1fc4e249e	[Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927 ) * fix mtp logprob hang when include stop_seq	2026-01-08 14:21:24 +08:00
Jiaxin Sui	dc170e3005	[XPU][CI]Update CI workflow to include all file types (#5943 ) Removed paths-ignore for markdown and text files.	2026-01-08 12:03:26 +08:00
FocusLuo	decbbb3933	[INTEL HPU] support only one release package of PaddleCustomDevice (#5910 ) Signed-off-by: Luo, Focus <focus.luo@intel.com>	2026-01-08 11:57:13 +08:00
CSWYF3634076	d8fcb7c07d	[Models] Add Qwen3-VL Moe Model Support (#5913 ) * [Model] add Qwen3vl moe model support * [Model] add Qwen3vl moe model support remove log * [Model] add Qwen3vl moe model support unittest	2026-01-08 11:36:42 +08:00
Daci	d8c6ba61f3	[BugFix] resource_manager_v1 lock PD (#5616 ) * bugfix resource_manager_v1 lock PD * with lock add_prefilled_request --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-08 10:02:54 +08:00
YuBaoku	5088d4acdb	[CI] Add daily build_linux jobs for CUDA 12.9 (#5936 ) To extend the daily CI coverage by adding Linux build jobs for CUDA 12.9.	2026-01-07 23:20:11 +08:00
FocusLuo	64f910553e	[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking (#5891 ) ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode reference command line: ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \ HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \ python -m fastdeploy.entrypoints.openai.api_server --model \ ./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \ --port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \ --cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \ --load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \ --max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \ --graph-optimization-config '{"use_cudagraph":false}' Signed-off-by: Luo, Focus <focus.luo@intel.com>	2026-01-07 21:31:53 +08:00
mouxin	0a92e96f20	[Feature] Add Golang-based Router for Request Scheduling and Load Balancing (#5882 ) * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing --------- Co-authored-by: mouxin <mouxin@baidu.com>	2026-01-07 21:28:08 +08:00
chenjian	925e7edd3c	[Bug fix] Limit multi-modal request to 1 (#5901 )	2026-01-07 20:25:07 +08:00

1 2 3 4 5 ...

4408 Commits