FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-25 09:57:51 +08:00

Author	SHA1	Message	Date
GoldPancake	a1fc4e249e	[Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927 ) * fix mtp logprob hang when include stop_seq	2026-01-08 14:21:24 +08:00
FocusLuo	decbbb3933	[INTEL HPU] support only one release package of PaddleCustomDevice (#5910 ) Signed-off-by: Luo, Focus <focus.luo@intel.com>	2026-01-08 11:57:13 +08:00
CSWYF3634076	d8fcb7c07d	[Models] Add Qwen3-VL Moe Model Support (#5913 ) * [Model] add Qwen3vl moe model support * [Model] add Qwen3vl moe model support remove log * [Model] add Qwen3vl moe model support unittest	2026-01-08 11:36:42 +08:00
Daci	d8c6ba61f3	[BugFix] resource_manager_v1 lock PD (#5616 ) * bugfix resource_manager_v1 lock PD * with lock add_prefilled_request --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-08 10:02:54 +08:00
FocusLuo	64f910553e	[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking (#5891 ) ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode reference command line: ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \ HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \ python -m fastdeploy.entrypoints.openai.api_server --model \ ./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \ --port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \ --cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \ --load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \ --max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \ --graph-optimization-config '{"use_cudagraph":false}' Signed-off-by: Luo, Focus <focus.luo@intel.com>	2026-01-07 21:31:53 +08:00
mouxin	0a92e96f20	[Feature] Add Golang-based Router for Request Scheduling and Load Balancing (#5882 ) * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] add golang router * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing * [Feature] Add Golang-based Router for Request Scheduling and Load Balancing --------- Co-authored-by: mouxin <mouxin@baidu.com>	2026-01-07 21:28:08 +08:00
chenjian	925e7edd3c	[Bug fix] Limit multi-modal request to 1 (#5901 )	2026-01-07 20:25:07 +08:00
lizhenyun01	2be8656c29	[BugFix] fix mtp split kv attetion (#5920 ) * [BugFix] fix mtp split kv attetion * clean code * clean code	2026-01-07 04:07:31 -08:00
chenjian	c883a2d3ec	[Optimization] Reduce preemption occurrence when blocks not enough (#5696 ) * [Optimize] Reduce preemption occurrence when blocks not enough for decoding * fix * fix * fix spell * optimize performance * fix	2026-01-07 20:01:16 +08:00
Ryan	3e74bacc5e	add m_grouped_gemm_fp8_fp8_bf16_nt_contiguous_custom_python_op (#5847 )	2026-01-07 16:17:55 +08:00
kevin	eabd01cd21	[BugFix] fix eb5 prefix bug (#5879 ) * fix eb5 prefix bug * update ci test * update code * update code * update code * update code * update code * update code * update code	2026-01-06 23:50:39 -08:00
kevin	a76e8ae40c	[Feature] support rdma pd dy-c8 (#5788 ) * add rdma pd dy-c8 * update code	2026-01-07 14:55:25 +08:00
sunxin	6ee8241521	[V1 Loader] Support loading static C8 scale JSON (#5909 ) * v1 loader: support loading static C8 scale JSON * update	2026-01-06 19:49:30 -08:00
fmiao2372	1ee285c2d6	[Intel HPU] enable chunked prefill (#5903 ) * [Intel HPU] enable chunked prefill * fix bug by copilot comments	2026-01-06 21:01:50 +08:00
周周周	83ae59431e	[BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp (#5895 )	2026-01-06 15:39:06 +08:00
ddchenhao66	733014bf32	[XPU] Support EP4TP1 in pd disaggregation (#5860 ) Co-authored-by: ddchenhao66 <dhaochen163.com>	2026-01-06 15:25:36 +08:00
gaoziyuan	e99ec4c9d5	[Bugfix]fix model weight signal tensor num (#5900 )	2026-01-06 14:36:59 +08:00
Yonghua Li	9445fbe054	[KVCache] launch cache transfer processes only if hierarchical cache or kv cache storage is enabled (#5871 ) * [fix] temporarily forbid cpu cache in update/clear api * [fix] stop launching cache transfer manager unless hierarchical cache is enabled * [fix] fix no attr hierarchical cache * [fix] fix ci * [fix] fix test_prefix_cache_manager.py	2026-01-06 14:27:47 +08:00
Yonghua Li	9fc2400e71	[BugFix] fix mtp cache attaching for pd disaggregation (#5884 ) * [fix] fix mtp cache attaching for pd disaggregation * [fix] fix test_mtp_proposer.py	2026-01-06 14:17:53 +08:00
jc	e9b25aa72f	[BugFix] Storage backend gets env params (#5892 ) * Storage backend gets env params * up * up * up	2026-01-06 14:14:17 +08:00
lizexu123	acdf0cd1d9	fix hadamard_block_size (#5888 )	2026-01-06 14:12:14 +08:00
qwes5s5	b3ca7f041a	[BugFix] Fix redundant prompt_logprobs in the second chunk of streaming response when return_token_ids is enabled for v1/completions and fix trace file name (#5829 ) * fix prompt logprobs bug * fix trace file name --------- Co-authored-by: qwes5s5 <root@yq01-sys-rpm26xc1knu.yq01.baidu.com>	2026-01-06 14:11:43 +08:00
freeliuzc	ca574119e5	support multi-step draft-model with cudagraph (#5886 )	2026-01-06 11:16:21 +08:00
Neil Zhu	272a371635	[Metax] optimize flash attention backend (#5876 )	2026-01-06 09:52:09 +08:00
lizexu123	1d3ae7c024	[BugFix] fix w4afp8 tp=8 (#5868 ) * fix w4afp8 tp=8 * fix	2026-01-05 18:59:02 +08:00
tianhaodongbd	6f14b180e3	[RL] Change 'model' to the instance variable 'tmp_model' (#5872 )	2026-01-05 02:09:02 -08:00
ming1753	f50e1bcc16	[Others] enable use PFCC deep_ep (#5822 ) * upstream deep_ep * fix bug * fix bug * modify env name	2026-01-05 02:07:01 -08:00
jc	8d384f9fd8	[PD Disaggregation] Update usage of pd disaggregation and data parallel (#5742 ) * Update usage of pd disaggregation * up * up * up * up * up * up * up * up * up * up dp docs * up * up * up * fix unittest	2026-01-05 17:51:29 +08:00
cmcamdy	690d4bcdb0	[XPU] Speculative Decoding with PD (#5856 ) * [XPU] Speculative Decoding with PD * fix post process * share kv cache sender * support speculate decoding step system cache * support speculate decoding step system cache --------- Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com>	2026-01-05 17:31:03 +08:00
周周周	dc13344ab8	[Optimization] add del to decrease peak memory in MoE prefill (#5863 )	2026-01-05 14:01:48 +08:00
jc	e911ac2ce7	[BugFix] Refine the preparation of cpu and storage cache (#5777 ) * Refine the preparation of cpu and storage cache * fix error * fix error * up * fix * up docs * fix unittest * remove debug info	2026-01-05 10:13:30 +08:00
jc	95257c1dbd	[Feature] RDMACommunicator send key and value scale (#5737 ) * RDMACommunicator send key and value scale --------- Co-authored-by: kevin <chengyf112@gmail.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-05 10:04:24 +08:00
Copilot	7d5282e158	[APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT (#5865 ) * Initial plan * Add configurable FD_WORKER_ALIVE_TIMEOUT environment variable Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Add test for FD_WORKER_ALIVE_TIMEOUT environment variable Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Update docs/zh/usage/environment_variables.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update docs/usage/environment_variables.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Improve test coverage to validate integration with check_health calls Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Remove test_worker_alive_timeout.py per reviewer feedback Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-05 09:47:12 +08:00
Yonghua Li	5e4e6692a4	[BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840 ) * [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 * [fix] fix mtp cache in mtp.py * [fix] fix gpu ops import * [fix] fix mtp layer idx * [fix] fix xpu model runner mtp cache * [fix] fix mtp import	2026-01-04 04:35:37 -08:00
kevin	52dc9a7b85	[BugFix] skip mm revert (#5848 ) * skip mm revert * update code * update test	2026-01-04 14:25:45 +08:00
MingkunZhang	f732d7d2ad	[Metax] adapt prefix caching & cpu swap (#5844 ) Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>	2025-12-31 17:02:48 +08:00
chen	193886e745	only cuda run triton op (#5846 )	2025-12-31 14:17:31 +08:00
GoldPancake	4e10ae5d99	[Speculative Decoding] Optimize draft logprob (#5842 ) * optimize draft logprob * fix ut	2025-12-31 13:35:56 +08:00
ddchenhao66	9e45ef7ca9	[XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831 )	2025-12-31 09:49:12 +08:00
kevin	74e162697f	eb5 mm skip prefix cache (#5838 )	2025-12-30 05:30:48 -08:00
chen	0bcf924e10	[Optimization] Optimization for gather_logprob by 10GB (#5817 ) * opt logprobs gather_logprob,reduce device memory usage by 10GB when token_num=8k	2025-12-30 15:33:34 +08:00
lizexu123	44a13e4557	[Feature] support w4afp8 v1_loader and v0_loader(tp>1) (#5757 ) * support * fix * support w4afp8 v1_loader and v0_loader * fix * fix test * fix test * fix test * fix moe.py * add test_ernie_4_5_w4afp8 * add test * delete tensor * fix test * fix * add * fix test	2025-12-30 14:11:52 +08:00
GoldPancake	e78e22ebd5	[BugFix] Fix entropy bugs (#5818 ) * fix entropy bugs * fix ut * fix	2025-12-29 20:44:29 -08:00
tianhaodongbd	edb9647422	[RL] add lm_head_fp32 in RolloutModelConfig (#5825 )	2025-12-29 20:22:30 -08:00
周周周	7ae13b2326	[PD Disaggregation]remove unsed para in RDMACommManager (#5814 )	2025-12-30 11:38:30 +08:00
CSWYF3634076	deb9698ac5	remove invalid elif branch (#5821 )	2025-12-29 19:21:28 +08:00
CSWYF3634076	9286403570	[Models] Add Qwen3-VL Model Support (#5763 ) * support v1 loader * remove useless code * remove useless * [Model] support Qwen3VL images success * [Model] support Qwen3VL rope_3d * [Model] support Qwen3VL remove log * [Model] support Qwen3VL RL * [Model] support Qwen3VL tp * [Model] support Qwen3VL video * [Model] support Qwen3VL fix ernievl * [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds * [Model] support Qwen3VL fix multi card * [Model] support Qwen3VL file close * [Model] support Qwen3VL fix ce * [Model] support Qwen3VL fix unittest * [Model] support Qwen3VL add unittest --------- Co-authored-by: Ayakouji <yuhongh@qq.com>	2025-12-29 17:39:33 +08:00
Ryan	eb782a0225	[BugFix] Fix return value inconsistency for `ep_moe_expert_combine` op (#5812 )	2025-12-29 16:44:00 +08:00
ddchenhao66	56a9ecccb2	[XPU] xpu support ep4tp4 (#5773 ) * [XPU] xpu support ep4tp4 * Add commands to check multiprocessing and fastdeploy processes --------- Co-authored-by: ddchenhao66 <dhaochen163.com> Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2025-12-29 11:27:01 +08:00
chenjian	91a2b13676	[BugFix] Fix preemption out of real_bsz (#5805 )	2025-12-29 09:52:36 +08:00

1 2 3 4 5 ...

1587 Commits