FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
sunxin	7a78001be2	fix execute_model_normal in empty run (#6968 )	2026-03-23 14:07:46 +08:00
周周周	1c38da2118	Make seq_lens_this_time/decoder/encoder equal shape (#6942 )	2026-03-20 15:31:52 +08:00
qwes5s5	3b7507a4c2	test_abort (#6743 )	2026-03-17 14:06:40 +08:00
huicongyao	eab429d05e	fix performance drop while no spec (#6866 )	2026-03-17 13:06:36 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
ming1753	bb925c605f	[Other] Adjust GPUModelRunner to enhance compatibility (#6851 )	2026-03-16 14:49:19 +08:00
huicongyao	2e63d88f7a	[Optimization][Speculative Decoding]Fuse padding sampling params (#6765 ) * optimize speculate pre process unit test * Add CUDA kernel for building sampling params in speculative decoding * init infer seed in device * format code * add unittest & fix * fix * format-code * format-code * fix rebase * . * fix unitest	2026-03-12 05:05:15 -07:00
RAM	cdaf6dd400	[RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599 ) * cherry-pick Support Fully Async and PrefixCache step 1 * copy routing_indices_cache.py from 2.4 * cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388) * cherry-pick [RL] Clear Requests status of R3 (#6569) * delete code * fix rename bug * fix status shape bug * fix ci	2026-03-12 01:13:30 -07:00
Yonghua Li	7811eeccaa	[fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758 )	2026-03-11 15:02:32 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
sunxin	812657beee	fix pd overlap (#6753 )	2026-03-10 20:29:54 +08:00
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
sunxin	28f7727a3d	[Feature] Set overlap schedule as default (#6668 ) * overlap default	2026-03-09 22:34:54 +08:00
jc	b0fd242add	[BugFix] Fix error in dynamic c8 cache (#6544 ) * [BugFix] Fix error in dynamic c8 cache * fix device id	2026-03-06 10:11:23 +08:00
sunxin	0dc7034ce0	[Model Runner] Deprecate not_need_stop (#6356 ) * Deprecate not_need_stop	2026-03-05 10:55:42 +08:00
sunxin	aee97e3aae	fix exist_prefill_flag when preempted task (#6629 )	2026-03-04 11:11:40 +08:00
huicongyao	0f718baaf2	[Speculative Decoding]Reformat input preprocess for spec decode (#6501 ) * add speculate_pre_process kernel * reduce one slice * make d2h async && fix mtp bug for new pre_process * fix * add unitest * fix: code stype formatting * fix * fix: thread race in speculate_preprocess && rename d2h event	2026-03-03 10:22:07 +08:00
ming1753	97eee75677	[Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407 ) * Optim GPU Mem Usage --------- Co-authored-by: huzesen <huzesen@baidu.com>	2026-02-28 15:07:43 +08:00
cmcamdy	13447279aa	[XPU] Fix PD + MTP (#6495 ) * fix pd + mtp * fix code style * fix PD + MTP, D get P's first token * add anno for gpu(speculate_update) * update draft insertv1 * fix wapper & kernel * fix wapper * fix code stype	2026-02-27 19:07:35 +08:00
gongweibao	edd31e8849	[Feature] Add Deterministic Inference Support (#6476 ) * add * [tests] Add Paddle attention determinism tests and refactor resource manager Add comprehensive determinism tests for Paddle attention layer and refactor resource manager for deterministic mode support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add * add * add * add * add more * add more * fixsome * fixsome * fix bugs * fix bugs * only in gpu * add docs * fix comments * fix some * fix some * fix comments * add more * fix potential problem * remove not need * remove not need * remove no need * fix bug * fix bugs * fix comments * fix comments * Update tests/ce/deterministic/test_determinism_verification.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/inter_communicator/test_ipc_signal.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/engine/test_sampling_params_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism_standalone.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix comments * fix import error * fix a bug * fix bugs * fix bugs * fix coverage * refine codes * refine code * fix comments * fix comments * fix comments * rm not need * fix allreduce large tensor bug * mv log files * mv log files * add files --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-26 19:31:51 -08:00
GoldPancake	2178f2829b	[Speculative Decoding] Support suffix decoding (#6403 ) * support suffix decoding	2026-02-26 11:42:05 +08:00
Yuanle Liu	6d3fede240	[OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493 ) * Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-25 21:36:50 +08:00
Yonghua Li	e2332a1112	[BugFix] fix num_cpu_blocks computation (#6438 ) * [BugFix] fix num_cpu_blocks computation * [fix] fix syntax and log * [fix] pre-commit * [fix] use getattr * [fix] ci test	2026-02-13 11:05:14 +08:00
yzwu	60e75ea8e8	[Iluvatar][CI] Fix cannot import get_stop (#6165 )	2026-02-10 16:57:23 +08:00
kevin	3ce842b55b	[BugFix] add reset shared inputs when update weight dummy run (#6331 ) * fix dummy run input bug * update code * update code * update code * update code	2026-02-10 10:29:03 +08:00
kevin	d60daca4a8	[Feature] consider multimodal model when dummy run (#6045 ) * add mm do profile * updata code * update code * update code * update code * update test case * update code * update code * fix xpu bug * update code * add mm do profile * update test case * update code	2026-02-09 17:49:55 +08:00
sunxin	783d56e28a	[Optimization] Support logprob async copy (#6362 ) * support logprob async copy * fix prompt logprob * fix xpu	2026-02-09 17:32:12 +08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
GoldPancake	183b8d325a	[RL] Support GLM MTP RL Model (#6267 )	2026-02-04 20:14:35 +08:00
sunxin	9b0a82cfa9	[Model Runner] Support overlap schedule (#6259 )	2026-02-04 10:49:44 +08:00
bukejiyu	12d4b4cb87	[Feature]Support reorder ids to split prefill and decodes (#5779 ) * support reorder ids * perfect code * fix * fix unittest * delete code * fix * add python api * delete custom op * update algorithm * fix swap * support condense * support condense * support mtp * delete code * update * update * update * update * update for other platfrom * update * fix * fix mtp * fix ut * update * fix ut * update ut * fix * fix encoder_cache * fix ci * fix * fix vl * Fix performance regression * fix * fix * fix mtp * fix index->req_id mapping * fix ut --------- Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-03 00:28:02 -08:00
Ryan	5e78c1ac87	[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode (#6196 ) * refine comment && refine variable name * replace comment	2026-01-29 16:29:54 +08:00
GoldPancake	7d6c87c29e	[Others] Support constrained decoding when enable_thinking is false (#6248 ) * support constrained decoding when enable_thinking is false * fix * fix * fix	2026-01-28 00:05:17 -08:00
sunxin	27f8799f04	[Model Runner] Refactor execute_model for GPU async scheduling (#6176 )	2026-01-28 14:19:33 +08:00
freeliuzc	ce06c6dfb3	[BugFix] Fix token_penalty kernel (#6069 ) * fix token_penalty kernel * try to fix xpu * fix xpu * fix unit test	2026-01-28 12:03:05 +08:00
sunxin	adc69c15d0	[Model Runner] Prepare token count and move FA3 initialization into the graph (#6170 ) * prepare for token num and put FA3 init in graph	2026-01-26 12:16:57 +08:00
周周周	0966df78dc	[Others] remove stop_nums (#6182 )	2026-01-26 12:12:47 +08:00
Yonghua Li	833d00e2d7	[BugFix] move cache creation back to cache transfer process and adapt clear/update (#6144 ) * [fix] move cache creation back to cache transfer process * [fix] fix clear cache * [chore] change some log level * [fix] fix clear cache * [fix] fix clear cache for blockwisefp8 and mtp * [fix] fix c8 * [fix] fix clear_mtp_cache args * [chore] update cache_transfer_manager * [fix] fix update mtp cache	2026-01-24 21:59:13 +08:00
sunxin	bef6293552	[Model Runner] Add exist_prefill_flag (#6172 )	2026-01-23 13:07:05 +08:00
wangyifei	b7c5daa316	[RL] add pause, update_weights, resume interface for async RL (#6052 ) * support dynamic run_control_request through zmq from apiserver to common_engine * support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method * change /is_puased from HTTP POST method to GET method * add pause、resume、is_paused implementation * support engine <==> worker communication(request&response) * support sync weights through RDMA from checkpoint_transfer * support specified version, rsync_config in update_weights rpc call * add pause, update_weights, resume interface for async RL * bug fix: update_weights support using default arguments * fix typo * typo fix * typo fix * typo fix * add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all * add "rsync" to LoadConfig.load_strategy Literal type hints Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * typo fix * typo fix * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * check version/rsync params * add error log when version.txt not exists Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * raise specified ValueError when paramters check failed Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * tp barrier after run_control_method * encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue * typo fix * typo fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-23 10:18:07 +08:00
Ryan	dda27e50f5	[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081 ) * rm static_op_get_block_shape_and_split_kv_block from cudagraph * update max_capture_shape * fallback: zeros -> empty to avoid coverage check * check graph_opt_config exists * add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test * add use_cudagraph flag to control step_use_cudagraph	2026-01-20 14:05:18 +08:00
jackyYang6	988e0bc338	[Feature] Add PaddleFormers fallback backend (#5999 ) * feat(paddleformers): add dense text model fallback backend * docs(paddleformers): add user guide and fix code review issues * add fallback unit test * precommit format * fix pre-commit * fix: address code review feedback * docs: add PaddleFormers backend documentation (EN) and simplify installation --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-19 21:50:50 +08:00
GoldPancake	05fbd89a8e	[Speculative Decoding][Bugfix] Fix MTP logprob issues caused by max_num_logprobs (#6084 )	2026-01-19 14:55:36 +08:00
GoldPancake	b917b56aca	[Bugfix] Fix logprob issues caused by max_num_logprobs (#6067 )	2026-01-16 04:40:18 -08:00
GoldPancake	bda38aa519	[Speculative Decoding] Support MTP for GLM-4.5-Air (#6047 ) * glm mtp * add spec neox partial rope	2026-01-16 14:35:24 +08:00
freeliuzc	49617d9832	[Feature]Support tag phase token enforce generation (#6034 ) * support tag phase token enforce generation * optimize note and some feature * fix sampler unit test --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-15 03:59:55 -08:00
RAM	b3f59fd9b5	[RL][CI] Support Async R3 And Add Accuracy Test (#5937 ) * add bs1 r3 test case * async put * r3 test case 1.0 * success run eb5 * refine test case * pre-commit * add eb45 & glm testcase * format code * add p2pstore requirements * support only last turn * R3 use worker log * refine code &fix ci bug * refine error mesg * fix empty input bug * Success set acc ci of eb45 and glm45 * refine code * fix bug	2026-01-14 04:25:06 -08:00
chenjian	74d0f1c01f	[Optim] Robust sync status when preempted happens (#5796 ) * [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-14 12:07:33 +08:00
GoldPancake	eb8ce36ae9	[BugFix] Fix entropy calculation issue in TP (#5997 ) * fix entropy bugs	2026-01-13 11:10:46 +08:00
GoldPancake	e41d434548	[Bugfix] Fix entropy calculation bugs (#5941 ) * fix entropy bugs	2026-01-08 20:57:35 +08:00

1 2 3 4 5

218 Commits