周周周
cbdb2462ea
cp 1131 tbo to develop ( #6281 )
2026-02-03 15:23:23 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
Ryan
5e78c1ac87
[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode ( #6196 )
...
* refine comment && refine variable name
* replace comment
2026-01-29 16:29:54 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
sunxin
27f8799f04
[Model Runner] Refactor execute_model for GPU async scheduling ( #6176 )
2026-01-28 14:19:33 +08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
jc
b1698a79cb
[RL] add version to the key of cache storage && refine raising error ( #6160 )
...
* Waiting for cache transfer manager inited
* up
* up
* up
* up
* up
* fix according comments
* fix unittest
* fix
* fix unittest
* fix error
* pass storage_backend to worker
2026-01-27 10:47:46 +08:00
CSWYF3634076
08c411518f
[Loader] support dummy load weight ( #6169 )
...
* [Loader] support dummy load weight
* [Loader] support dummy load weight v2
* [Loader] support dummy load weight unittest
* [Loader] support dummy load weight unittest v2
* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin
adc69c15d0
[Model Runner] Prepare token count and move FA3 initialization into the graph ( #6170 )
...
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周
0966df78dc
[Others] remove stop_nums ( #6182 )
2026-01-26 12:12:47 +08:00
Yonghua Li
833d00e2d7
[BugFix] move cache creation back to cache transfer process and adapt clear/update ( #6144 )
...
* [fix] move cache creation back to cache transfer process
* [fix] fix clear cache
* [chore] change some log level
* [fix] fix clear cache
* [fix] fix clear cache for blockwisefp8 and mtp
* [fix] fix c8
* [fix] fix clear_mtp_cache args
* [chore] update cache_transfer_manager
* [fix] fix update mtp cache
2026-01-24 21:59:13 +08:00
sunxin
bef6293552
[Model Runner] Add exist_prefill_flag ( #6172 )
2026-01-23 13:07:05 +08:00
wangyifei
b7c5daa316
[RL] add pause, update_weights, resume interface for async RL ( #6052 )
...
* support dynamic run_control_request through zmq from apiserver to common_engine
* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method
* change /is_puased from HTTP POST method to GET method
* add pause、resume、is_paused implementation
* support engine <==> worker communication(request&response)
* support sync weights through RDMA from checkpoint_transfer
* support specified version, rsync_config in update_weights rpc call
* add pause, update_weights, resume interface for async RL
* bug fix: update_weights support using default arguments
* fix typo
* typo fix
* typo fix
* typo fix
* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all
* add "rsync" to LoadConfig.load_strategy Literal type hints
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* typo fix
* typo fix
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* check version/rsync params
* add error log when version.txt not exists
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* raise specified ValueError when paramters check failed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* tp barrier after run_control_method
* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue
* typo fix
* typo fix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-01-23 10:18:07 +08:00
yinwei
3cd0ffe36c
Enable CudaGraph
2026-01-22 19:49:33 +08:00
yinwei
1e3c35496c
[XPU][Graph Optimization] XPU Support CUDAGraph ( #6152 )
...
* support cuda graph
2026-01-22 14:41:56 +08:00
Haonan Luo
82057cb71f
Support MXFP4 for GPT-OSS ( #5435 )
...
* support mxfp4 in gpt-oss
* support mxfp4 in gpt-oss
* add scope for flashinfer
* remove torch code
* update envs.FD_MXFP4_BACKEND
* update process_weights_after_loading
* update env name
* support tp in gpt-oss, add e2e test
* add flashinfer-python-paddle in requirements
* fix import error
* add test
* add test
* add test
* add test
2026-01-22 14:21:01 +08:00
zccjjj
14a64e9b3b
[XPU] change XPU EP interface from xDeepEP to paddle ( #5706 )
...
* add ENV VAR to controll low lantency buffer
2026-01-21 18:23:45 +08:00
yinwei
85d995100a
Update Dummy Run To Suppport Mutil-Batch Execution ( #6123 )
2026-01-21 14:20:44 +08:00
Ryan
dda27e50f5
[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph ( #6081 )
...
* rm static_op_get_block_shape_and_split_kv_block from cudagraph
* update max_capture_shape
* fallback: zeros -> empty to avoid coverage check
* check graph_opt_config exists
* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test
* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang
45ebb2efb4
[XPU] support plugin model ( #6092 )
2026-01-20 13:00:09 +08:00
jackyYang6
988e0bc338
[Feature] Add PaddleFormers fallback backend ( #5999 )
...
* feat(paddleformers): add dense text model fallback backend
* docs(paddleformers): add user guide and fix code review issues
* add fallback unit test
* precommit format
* fix pre-commit
* fix: address code review feedback
* docs: add PaddleFormers backend documentation (EN) and simplify installation
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-19 21:50:50 +08:00
GoldPancake
05fbd89a8e
[Speculative Decoding][Bugfix] Fix MTP logprob issues caused by max_num_logprobs ( #6084 )
2026-01-19 14:55:36 +08:00
ddchenhao66
3685474799
[XPU] xpu support mm prefill batch ( #6072 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-19 14:36:35 +08:00
GoldPancake
b917b56aca
[Bugfix] Fix logprob issues caused by max_num_logprobs ( #6067 )
2026-01-16 04:40:18 -08:00
周周周
97f96e34ca
only update self.exist_prefill_task_signal in v0 ( #6064 )
...
* commit
* commit
* commit
---------
Co-authored-by: xiaoluomi <1037819816@qq.com >
2026-01-16 20:11:55 +08:00
GoldPancake
bda38aa519
[Speculative Decoding] Support MTP for GLM-4.5-Air ( #6047 )
...
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
guozhuangzhuang
d2f1ec2b1b
[XPU] fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode ( #6048 )
...
* fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode
- Set seq_lens_encoder to 0 when splitwise_role is 'decode' during prefill processing
- This ensures proper continuation of decoding after P generate first token in PD disaggregated architecture
- Fixes potential sequence length inconsistency in PD splitwise deployment scenarios
* format
2026-01-15 20:24:56 +08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
cmcamdy
59d8ae0a25
[XPU] Speculate Decoding + PD, benchmark fix ( #6036 )
...
* fix mtp pd
* fix kernel
* fix code style
* fix kernel
* fix test / clear debug code
* fix test / clear debug code
* fix codestyle
* fix codestyle
* fix codestyle
2026-01-15 19:19:03 +08:00
Cheng Yanfei
fbcccaa750
[Intel HPU] enable MoE EP for hpu ( #5855 )
...
* enable HPU MoE EP
* MoE intermediate_scale stack
* enable loader_v1 esp for tensor_wise_fp8 TP or EP
* modify activation_scale name
2026-01-15 13:08:00 +08:00
ming1753
7c56041272
[BugFix] fix PaddleOCR-VL illegal memory ( #6042 )
2026-01-14 20:07:43 -08:00
RAM
b3f59fd9b5
[RL][CI] Support Async R3 And Add Accuracy Test ( #5937 )
...
* add bs1 r3 test case
* async put
* r3 test case 1.0
* success run eb5
* refine test case
* pre-commit
* add eb45 & glm testcase
* format code
* add p2pstore requirements
* support only last turn
* R3 use worker log
* refine code &fix ci bug
* refine error mesg
* fix empty input bug
* Success set acc ci of eb45 and glm45
* refine code
* fix bug
2026-01-14 04:25:06 -08:00
luukunn
93b7675a64
[Feature]Report FD statistical information ( #5646 )
...
* add usage commit
* update envs and xpu
* add requirements
* fix quantization value
* add unit test
* add unit test
* fix unit test
* add unit test
* add unit test
* add unit test
* add unit test
* add unit test
* add unit test
* fix FD_USAGE_STATS_SERVER
* fix
* fix
* add doc
* add doc
* add doc
* add doc
* add doc
* fix file name
2026-01-14 17:54:01 +08:00
MingkunZhang
273e79aa5b
[Metax][Fix] fix self.share_inputs['preempted_idx']=[] incorrect use ( #6038 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-14 17:06:00 +08:00
chenjian
74d0f1c01f
[Optim] Robust sync status when preempted happens ( #5796 )
...
* [Bug fix] Sync status for caching output cache
* fix
* fix
* fix bug
* fix
* fix
* support xpu
* fix
* fix
* fix
* fix
* fix
* fix ci
* fix ci
* fix xpu
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-14 12:07:33 +08:00
Yonghua Li
456637002d
[BugFix] fix cache transfer manager updating/clearing ( #5930 )
...
* [fix] fix cache transfer manager updating/clearing
* [fix] fix code style
* [fix] fix config
* [fix] fix engine client
* [fix] let worker update kv cache status signal
* [fix] update worker process
* [fix] fix clear/update for case if comm group is shutdown
* [fix] update dynamic weight manager
* [fix] fix port
* [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting
2026-01-13 05:09:29 -08:00
GoldPancake
eb8ce36ae9
[BugFix] Fix entropy calculation issue in TP ( #5997 )
...
* fix entropy bugs
2026-01-13 11:10:46 +08:00
lzy
223b2f5d86
Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. ( #5917 )
2026-01-12 14:09:39 +08:00
zhupengyang
9db48ecb34
[XPU] fix dp4 ( #5946 )
2026-01-09 20:36:53 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
GoldPancake
3ca99ab170
[Speculative Decoding] Return accepted tokens per head in response ( #5947 )
...
* adjust log level
* add accepted tokens per head
* fix ut
* fix
2026-01-09 14:32:08 +08:00
GoldPancake
e41d434548
[Bugfix] Fix entropy calculation bugs ( #5941 )
...
* fix entropy bugs
2026-01-08 20:57:35 +08:00
Copilot
6825903559
[BugFix] Fix misleading logging in worker_process for request counting ( #5939 )
...
* Initial plan
* Optimize logging in worker_process to accurately reflect request types
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Address feedback: rename to max_occupied_batch_index and simplify logging
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Improve comment clarity for batch request counting
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
* Fix code style: reorder imports with isort
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-08 16:36:22 +08:00
fmiao2372
1ee285c2d6
[Intel HPU] enable chunked prefill ( #5903 )
...
* [Intel HPU] enable chunked prefill
* fix bug by copilot comments
2026-01-06 21:01:50 +08:00
ddchenhao66
733014bf32
[XPU] Support EP4TP1 in pd disaggregation ( #5860 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-06 15:25:36 +08:00
gaoziyuan
e99ec4c9d5
[Bugfix]fix model weight signal tensor num ( #5900 )
2026-01-06 14:36:59 +08:00
freeliuzc
ca574119e5
support multi-step draft-model with cudagraph ( #5886 )
2026-01-06 11:16:21 +08:00
cmcamdy
690d4bcdb0
[XPU] Speculative Decoding with PD ( #5856 )
...
* [XPU] Speculative Decoding with PD
* fix post process
* share kv cache sender
* support speculate decoding step system cache
* support speculate decoding step system cache
---------
Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com >
2026-01-05 17:31:03 +08:00
Yonghua Li
5e4e6692a4
[BugFix] fix cache manager not launched in case of mtp or blockwise fp8 ( #5840 )
...
* [BugFix] fix cache manager not launched in case of mtp or blockwise fp8
* [fix] fix mtp cache in mtp.py
* [fix] fix gpu ops import
* [fix] fix mtp layer idx
* [fix] fix xpu model runner mtp cache
* [fix] fix mtp import
2026-01-04 04:35:37 -08:00