Commit Graph

396 Commits

Author SHA1 Message Date
周周周 cbdb2462ea cp 1131 tbo to develop (#6281) 2026-02-03 15:23:23 +08:00
xiaozude 030647521a [Metax] adapt to the latest develop (#6282) 2026-01-29 23:21:20 -08:00
MingkunZhang c4abb01f9c [Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) (#6266) 2026-01-29 19:24:36 +08:00
Ryan 5e78c1ac87 [Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode (#6196)
* refine comment && refine variable name

* replace comment
2026-01-29 16:29:54 +08:00
GoldPancake 7d6c87c29e [Others] Support constrained decoding when enable_thinking is false (#6248)
* support constrained decoding when enable_thinking is false

* fix

* fix

* fix
2026-01-28 00:05:17 -08:00
sunxin 27f8799f04 [Model Runner] Refactor execute_model for GPU async scheduling (#6176) 2026-01-28 14:19:33 +08:00
freeliuzc ce06c6dfb3 [BugFix] Fix token_penalty kernel (#6069)
* fix token_penalty kernel

* try to fix xpu

* fix xpu

* fix unit test
2026-01-28 12:03:05 +08:00
jc b1698a79cb [RL] add version to the key of cache storage && refine raising error (#6160)
* Waiting for cache transfer manager inited

* up

* up

* up

* up

* up

* fix according comments

* fix unittest

* fix

* fix unittest

* fix error

* pass storage_backend to worker
2026-01-27 10:47:46 +08:00
CSWYF3634076 08c411518f [Loader] support dummy load weight (#6169)
* [Loader] support dummy load weight

* [Loader] support dummy load weight v2

* [Loader] support dummy load weight unittest

* [Loader] support dummy load weight unittest v2

* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周 0966df78dc [Others] remove stop_nums (#6182) 2026-01-26 12:12:47 +08:00
Yonghua Li 833d00e2d7 [BugFix] move cache creation back to cache transfer process and adapt clear/update (#6144)
* [fix] move cache creation back to cache transfer process

* [fix] fix clear cache

* [chore] change some log level

* [fix] fix clear cache

* [fix] fix clear cache for blockwisefp8 and mtp

* [fix] fix c8

* [fix] fix clear_mtp_cache args

* [chore] update cache_transfer_manager

* [fix] fix update mtp cache
2026-01-24 21:59:13 +08:00
sunxin bef6293552 [Model Runner] Add exist_prefill_flag (#6172) 2026-01-23 13:07:05 +08:00
wangyifei b7c5daa316 [RL] add pause, update_weights, resume interface for async RL (#6052)
* support dynamic run_control_request through zmq from apiserver to common_engine

* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method

* change /is_puased from HTTP POST method to GET method

* add pause、resume、is_paused implementation

* support engine <==> worker communication(request&response)

* support sync weights through RDMA from checkpoint_transfer

* support specified version, rsync_config in update_weights rpc call

* add pause, update_weights, resume interface for async RL

* bug fix: update_weights support using default arguments

* fix typo

* typo fix

* typo fix

* typo fix

* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all

* add "rsync" to LoadConfig.load_strategy Literal type hints

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo fix

* typo fix

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* check version/rsync params

* add error log when version.txt not exists

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* raise specified ValueError when paramters check failed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* tp barrier after run_control_method

* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue

* typo fix

* typo fix

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-23 10:18:07 +08:00
yinwei 3cd0ffe36c Enable CudaGraph 2026-01-22 19:49:33 +08:00
yinwei 1e3c35496c [XPU][Graph Optimization] XPU Support CUDAGraph (#6152)
* support cuda graph
2026-01-22 14:41:56 +08:00
Haonan Luo 82057cb71f Support MXFP4 for GPT-OSS (#5435)
* support mxfp4 in gpt-oss

* support mxfp4 in gpt-oss

* add scope for flashinfer

* remove torch code

* update envs.FD_MXFP4_BACKEND

* update process_weights_after_loading

* update env name

* support tp in gpt-oss, add e2e test

* add flashinfer-python-paddle in requirements

* fix import error

* add test

* add test

* add test

* add test
2026-01-22 14:21:01 +08:00
zccjjj 14a64e9b3b [XPU] change XPU EP interface from xDeepEP to paddle (#5706)
* add ENV VAR to controll low lantency buffer
2026-01-21 18:23:45 +08:00
yinwei 85d995100a Update Dummy Run To Suppport Mutil-Batch Execution (#6123) 2026-01-21 14:20:44 +08:00
Ryan dda27e50f5 [Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081)
* rm static_op_get_block_shape_and_split_kv_block from cudagraph

* update max_capture_shape

* fallback: zeros -> empty to avoid coverage check

* check graph_opt_config exists

* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test

* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang 45ebb2efb4 [XPU] support plugin model (#6092) 2026-01-20 13:00:09 +08:00
jackyYang6 988e0bc338 [Feature] Add PaddleFormers fallback backend (#5999)
* feat(paddleformers): add dense text model fallback backend

* docs(paddleformers): add user guide and fix code review issues

* add fallback unit test

* precommit format

* fix pre-commit

* fix: address code review feedback

* docs: add PaddleFormers backend documentation (EN) and simplify installation

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-19 21:50:50 +08:00
GoldPancake 05fbd89a8e [Speculative Decoding][Bugfix] Fix MTP logprob issues caused by max_num_logprobs (#6084) 2026-01-19 14:55:36 +08:00
ddchenhao66 3685474799 [XPU] xpu support mm prefill batch (#6072)
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-19 14:36:35 +08:00
GoldPancake b917b56aca [Bugfix] Fix logprob issues caused by max_num_logprobs (#6067) 2026-01-16 04:40:18 -08:00
周周周 97f96e34ca only update self.exist_prefill_task_signal in v0 (#6064)
* commit

* commit

* commit

---------

Co-authored-by: xiaoluomi <1037819816@qq.com>
2026-01-16 20:11:55 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
guozhuangzhuang d2f1ec2b1b [XPU] fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode (#6048)
* fix(xpu_model_runner): reset seq_lens_encoder to 0 for decode role in PD splitwise mode

- Set seq_lens_encoder to 0 when splitwise_role is 'decode' during prefill processing
- This ensures proper continuation of decoding after P generate first token in PD disaggregated architecture
- Fixes potential sequence length inconsistency in PD splitwise deployment scenarios

* format
2026-01-15 20:24:56 +08:00
freeliuzc 49617d9832 [Feature]Support tag phase token enforce generation (#6034)
* support tag phase token enforce generation

* optimize note and some feature

* fix sampler unit test

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-15 03:59:55 -08:00
cmcamdy 59d8ae0a25 [XPU] Speculate Decoding + PD, benchmark fix (#6036)
* fix mtp pd

* fix kernel

* fix code style

* fix kernel

* fix test / clear debug code

* fix test / clear debug code

* fix codestyle

* fix codestyle

* fix codestyle
2026-01-15 19:19:03 +08:00
Cheng Yanfei fbcccaa750 [Intel HPU] enable MoE EP for hpu (#5855)
* enable HPU MoE EP

* MoE intermediate_scale stack

* enable loader_v1 esp for tensor_wise_fp8 TP or EP

* modify activation_scale name
2026-01-15 13:08:00 +08:00
ming1753 7c56041272 [BugFix] fix PaddleOCR-VL illegal memory (#6042) 2026-01-14 20:07:43 -08:00
RAM b3f59fd9b5 [RL][CI] Support Async R3 And Add Accuracy Test (#5937)
* add bs1 r3 test case

* async put

* r3 test case 1.0

* success run eb5

* refine test case

* pre-commit

* add eb45 & glm testcase

* format code

* add p2pstore requirements

* support only last turn

* R3 use worker log

* refine code &fix ci bug

* refine error mesg

* fix empty input bug

* Success set acc ci of eb45 and glm45

* refine code

* fix bug
2026-01-14 04:25:06 -08:00
luukunn 93b7675a64 [Feature]Report FD statistical information (#5646)
* add usage commit

* update envs and xpu

* add requirements

* fix quantization value

* add unit test

* add unit test

* fix unit test

* add unit test

* add unit test

* add unit test

* add unit test

* add unit test

* add unit test

* fix FD_USAGE_STATS_SERVER

* fix

* fix

* add doc

* add doc

* add doc

* add doc

* add doc

* fix file name
2026-01-14 17:54:01 +08:00
MingkunZhang 273e79aa5b [Metax][Fix] fix self.share_inputs['preempted_idx']=[] incorrect use (#6038)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-14 17:06:00 +08:00
chenjian 74d0f1c01f [Optim] Robust sync status when preempted happens (#5796)
* [Bug fix] Sync status for caching output cache

* fix

* fix

* fix bug

* fix

* fix

* support xpu

* fix

* fix

* fix

* fix

* fix

* fix ci

* fix ci

* fix xpu

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-14 12:07:33 +08:00
Yonghua Li 456637002d [BugFix] fix cache transfer manager updating/clearing (#5930)
* [fix] fix cache transfer manager updating/clearing

* [fix] fix code style

* [fix] fix config

* [fix] fix engine client

* [fix] let worker update kv cache status signal

* [fix] update worker process

* [fix] fix clear/update for case if comm group is shutdown

* [fix] update dynamic weight manager

* [fix] fix port

* [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting
2026-01-13 05:09:29 -08:00
GoldPancake eb8ce36ae9 [BugFix] Fix entropy calculation issue in TP (#5997)
* fix entropy bugs
2026-01-13 11:10:46 +08:00
lzy 223b2f5d86 Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917) 2026-01-12 14:09:39 +08:00
zhupengyang 9db48ecb34 [XPU] fix dp4 (#5946) 2026-01-09 20:36:53 +08:00
xiaoxiaohehe001 00a01ae024 [Feature] Support redundant expert for eplb (#5918)
* [BugFix] support redundant expert for eplb

* support redundant expert for eplb

* support redundant expert for eplb

* update

* fix ci eplb
2026-01-09 17:13:24 +08:00
GoldPancake 3ca99ab170 [Speculative Decoding] Return accepted tokens per head in response (#5947)
* adjust log level

* add accepted tokens per head

* fix ut

* fix
2026-01-09 14:32:08 +08:00
GoldPancake e41d434548 [Bugfix] Fix entropy calculation bugs (#5941)
* fix entropy bugs
2026-01-08 20:57:35 +08:00
Copilot 6825903559 [BugFix] Fix misleading logging in worker_process for request counting (#5939)
* Initial plan

* Optimize logging in worker_process to accurately reflect request types

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Address feedback: rename to max_occupied_batch_index and simplify logging

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Improve comment clarity for batch request counting

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Fix code style: reorder imports with isort

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-08 16:36:22 +08:00
fmiao2372 1ee285c2d6 [Intel HPU] enable chunked prefill (#5903)
* [Intel HPU] enable chunked prefill

* fix bug by copilot comments
2026-01-06 21:01:50 +08:00
ddchenhao66 733014bf32 [XPU] Support EP4TP1 in pd disaggregation (#5860)
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-06 15:25:36 +08:00
gaoziyuan e99ec4c9d5 [Bugfix]fix model weight signal tensor num (#5900) 2026-01-06 14:36:59 +08:00
freeliuzc ca574119e5 support multi-step draft-model with cudagraph (#5886) 2026-01-06 11:16:21 +08:00
cmcamdy 690d4bcdb0 [XPU] Speculative Decoding with PD (#5856)
* [XPU] Speculative Decoding with PD

* fix post process

* share kv cache sender

* support speculate decoding step system cache

* support speculate decoding step system cache

---------

Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com>
2026-01-05 17:31:03 +08:00
Yonghua Li 5e4e6692a4 [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840)
* [BugFix] fix cache manager not launched in case of mtp or blockwise fp8

* [fix] fix mtp cache in mtp.py

* [fix] fix gpu ops import

* [fix] fix mtp layer idx

* [fix] fix xpu model runner mtp cache

* [fix] fix mtp import
2026-01-04 04:35:37 -08:00