周周周
cbdb2462ea
cp 1131 tbo to develop ( #6281 )
2026-02-03 15:23:23 +08:00
周周周
8277b95fa6
remove speculate_get_padding_offset op ( #6308 )
2026-02-03 15:18:12 +08:00
Moonchild1227
39dc4b0c2e
[Feature] [KVCache] support file_store kv cache backend ( #6188 )
...
* fix(examples): comment out stop.sh to avoid error when script is missing
* feat: add file_store support for cache manager
* [fix] fix multi gpu transfer
* [fix] fix global kvcache transfer
* [Feature] [KVCache] support file_store kv cache backend
* chore: update FileStore according to PR comments
* fix: remove comments
* fix: add swap_cache_layout for file store
* fix: remove rank key
* fix: Switch KV cache storage to pure file mode
* Temporarily disable support for Tensor types
* fix: remove args --kvcache_file_path & add envs FILE_BACKEND_STORAGE_DIR
* fixx: Simplify cache_transfer_manager.py
* fix: fix syntax bug
* fix: Simplify file_store.py
* fix: Use the key directly as the filename
* fix: Simplify set()
* fix: Simplify cache_transfer_manager.py & file_store.py
* fix: Only support load to cpu buffer
* feat: add FileStore backend for cache transfer
* fix: guard zmq import
2026-02-03 14:37:58 +08:00
zccjjj
ee77ff9ebe
[config] fix assert message ( #6310 )
2026-02-03 14:37:46 +08:00
Jingfeng Wu
4760835789
Fix heartbeat signal's sleeptime error ( #6241 )
2026-02-03 14:28:51 +08:00
fxyfxy777
f3413c4caa
[BugFix] fix fused_mask_swiglu_fp8_quant bug ( #6316 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
* Revert "[Optimize] optimize mask_quant & swiglu (#6222 )"
This reverts commit 2ada119a38 .
* add block_size
* pre-commit
2026-02-03 13:54:12 +08:00
ApplEOFDiscord
6563b8307c
[Bug Fix] fix tokenizer oom ( #6287 )
...
* fix tokenizer oom
* fix unit test
2026-02-03 11:27:11 +08:00
GoldPancake
fb374238e1
Revert "[RL] Support GLM MTP RL Model ( #6223 )" ( #6301 )
...
This reverts commit af6c84d48d .
2026-02-02 14:08:13 +08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
chenjian
af1b1d2d56
[Feature] Support report token index by attention store ( #6285 )
...
* [Feature] Support report token index by attention store
* fix format
2026-02-02 10:41:11 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
chenjian
292bab7e6d
[BugFix] Fix bug for enable output caching ( #6226 )
...
* [BugFix] Fix bug for enable output caching
* fix
* Fix
* fix
* fix ci
2026-01-30 10:55:36 +08:00
mouxin
506f1545cd
[Feature] Enhance Router with /v1/completions, docs, scripts, and version info ( #5966 )
...
* [Doc] Update prerequisites in the documentation
* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info
* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info
---------
Co-authored-by: mouxin <mouxin@baidu.com >
2026-01-30 10:28:48 +08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
Ryan
5e78c1ac87
[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode ( #6196 )
...
* refine comment && refine variable name
* replace comment
2026-01-29 16:29:54 +08:00
yuxuan
44b52701f6
[Feature] Support NVFP4 MoE on SM100 ( #6003 )
...
* fp4 dense
* [WIP] support nvfp4, dense part
* [wip] developing loading qwen model
* loading
* update
* dense fp4 OK, cudagraph error
* [WIP] moe forward part
* with flashinfer-backend
* qwen3_moe_fp4
* update
* support flashinfer-cutlass moe, qwen3-moe-fp4 OK
* support ernie4.5-fp4
* fix load error
* add some ut
* add docs
* fix CLA, test
* fix the apply() in ModelOptNvFp4FusedMoE
* fix CodeStyle
* del the PADDLE_COMPATIBLE_API
* fix broken url: nvidia_gpu.md
* fix docs
* fix token_ids
* fix CI in Hopper
* move flashinfer imports inside the function
* fix model_runner
Removed the logic for generating random padding IDs.
* Remove skip condition for CUDA version in nvfp4 test
* add test for nvfp4
* fix according to review
* Add Chinese translation link to NVFP4 documentation
* del flashinfer.py
* fix unittest
---------
Co-authored-by: zoooo0820 <zoooo0820@qq.com >
Co-authored-by: bukejiyu <395822456@qq.com >
2026-01-29 14:16:07 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
GoldPancake
af6c84d48d
[RL] Support GLM MTP RL Model ( #6223 )
...
* support glm mtp rl model
* fix
* fix
* fix ut
* update baseline
2026-01-28 08:28:03 -08:00
ddchenhao66
6d33d5e370
[Models][BugFix] shared experts and dense mlp layer do not require TP split ( #6180 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-28 18:58:19 +08:00
chenjian
6e9a57b7c1
[Bug fix] Fix multi modal fetch feature ( #6095 )
2026-01-28 18:02:26 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
sunxin
27f8799f04
[Model Runner] Refactor execute_model for GPU async scheduling ( #6176 )
2026-01-28 14:19:33 +08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
Yuanle Liu
8b05774fad
[Others] enhance deep_ep import and support mixed mode flash_mask_attn ( #6238 )
...
* support flashmaskattn mixed and enhance deepep import
* update
* fix
2026-01-28 00:02:02 +08:00
qwes5s5
38378415c7
add token ratio metrics ( #6236 )
2026-01-27 17:00:49 +08:00
周周周
aa57864c5b
remove unneeded para from flash_mask_attention ( #6218 )
2026-01-27 14:04:27 +08:00
jc
b1698a79cb
[RL] add version to the key of cache storage && refine raising error ( #6160 )
...
* Waiting for cache transfer manager inited
* up
* up
* up
* up
* up
* fix according comments
* fix unittest
* fix
* fix unittest
* fix error
* pass storage_backend to worker
2026-01-27 10:47:46 +08:00
xiaoxiaohehe001
7ffa88bb01
[BugFix] fix mask_attn ( #6214 )
...
* [BugFix] fix mask attn
* [BugFix] fix mask attn
2026-01-26 07:46:51 -08:00
CSWYF3634076
08c411518f
[Loader] support dummy load weight ( #6169 )
...
* [Loader] support dummy load weight
* [Loader] support dummy load weight v2
* [Loader] support dummy load weight unittest
* [Loader] support dummy load weight unittest v2
* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin
adc69c15d0
[Model Runner] Prepare token count and move FA3 initialization into the graph ( #6170 )
...
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周
0966df78dc
[Others] remove stop_nums ( #6182 )
2026-01-26 12:12:47 +08:00
wangyifei
84a1780814
[build] support build sm 80,86,89,90 to one whl package ( #6173 )
...
* support build sm 80,86,89,90 to one whl package
* create tmp dir before build custom ops in FD_UNIFY_BUILD mode
* typo fix
* ignore exceptions in xpu ..
2026-01-26 11:30:02 +08:00
Yuanle Liu
253c5cc16c
Improve deep_ep import handling with logging ( #6207 )
...
* Improve deep_ep import handling with logging
Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging.
* Add traceback import to ep.py
2026-01-24 22:41:42 -08:00
Yonghua Li
833d00e2d7
[BugFix] move cache creation back to cache transfer process and adapt clear/update ( #6144 )
...
* [fix] move cache creation back to cache transfer process
* [fix] fix clear cache
* [chore] change some log level
* [fix] fix clear cache
* [fix] fix clear cache for blockwisefp8 and mtp
* [fix] fix c8
* [fix] fix clear_mtp_cache args
* [chore] update cache_transfer_manager
* [fix] fix update mtp cache
2026-01-24 21:59:13 +08:00
fxyfxy777
79f42209bf
add scale_wrapper for per_block_cast_to_fp8 ( #6183 )
2026-01-23 00:37:20 -08:00
sunxin
bef6293552
[Model Runner] Add exist_prefill_flag ( #6172 )
2026-01-23 13:07:05 +08:00
luukunn
0a19e1b6df
fix image gen ( #6175 )
2026-01-23 11:24:12 +08:00
luukunn
8635d8880d
bug fix tool_calls ( #6166 )
2026-01-23 10:49:27 +08:00
GoldPancake
646aced1eb
[UT] Add GLM E2E tests for non-MTP and MTP ( #6163 )
...
* add glm ut
2026-01-23 10:34:29 +08:00
wangyifei
b7c5daa316
[RL] add pause, update_weights, resume interface for async RL ( #6052 )
...
* support dynamic run_control_request through zmq from apiserver to common_engine
* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method
* change /is_puased from HTTP POST method to GET method
* add pause、resume、is_paused implementation
* support engine <==> worker communication(request&response)
* support sync weights through RDMA from checkpoint_transfer
* support specified version, rsync_config in update_weights rpc call
* add pause, update_weights, resume interface for async RL
* bug fix: update_weights support using default arguments
* fix typo
* typo fix
* typo fix
* typo fix
* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all
* add "rsync" to LoadConfig.load_strategy Literal type hints
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* typo fix
* typo fix
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* check version/rsync params
* add error log when version.txt not exists
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* raise specified ValueError when paramters check failed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* tp barrier after run_control_method
* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue
* typo fix
* typo fix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-01-23 10:18:07 +08:00
Ryan
31c219d483
[Graph Optimization] Add max_capture_shape_prefill && cudagraph_capture_sizes_prefill ( #6148 )
...
* Add max_capture_shape_dy2st parameter to YAML config
* split cudagraph capture size between decode and prefill
* rm if
* add default value
2026-01-22 21:37:18 +08:00
Yonghua Li
8d27a523e7
[Feature] [KVCache] support attention_store kv cache backend ( #5823 )
...
* [feat] support attention_store kv cache backend
* [fix] fix codestyle
* [chore] optimize log
* [fix] fix write storage task
* [fix] fix read storage
* [fix] fix code conflict after merge develop
* [fix] fix cache bytes and read task token ids
* [chore] add model for cache transfer manager
* [chore] add some log
* [chore] remove launched_cache_manager_signal
* [fix] fix write_back_storage_task match_block_num condition
* [fix] fix swap_cost_time
* [ci] fix ci
* Update fastdeploy/engine/sched/resource_manager_v1.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/cache_manager/cache_transfer_manager.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-01-22 21:01:23 +08:00
yinwei
3cd0ffe36c
Enable CudaGraph
2026-01-22 19:49:33 +08:00
Yonghua Li
bb76d3b6f0
[RL] [APIServer] add more status codes for update/clear api ( #6141 )
...
* [RL] add more status codes for update/clear api
* [feat] return json response
* [fix] fix ci
2026-01-22 17:26:18 +08:00
luukunn
6b968a76f1
【Optimization】update data_processor & add tool parser plugins ( #6096 )
...
* update data_processor
* fix unit test
* fix unit test
* add unit test
* add tool parser plugins
* fix tool call
* fix tool call
* fix tool call
* fix unit test
* fix unit test
* add unit test
* fix unit test
* fix unit test
* fix unit test
2026-01-22 17:17:32 +08:00
yinwei
1e3c35496c
[XPU][Graph Optimization] XPU Support CUDAGraph ( #6152 )
...
* support cuda graph
2026-01-22 14:41:56 +08:00
Haonan Luo
82057cb71f
Support MXFP4 for GPT-OSS ( #5435 )
...
* support mxfp4 in gpt-oss
* support mxfp4 in gpt-oss
* add scope for flashinfer
* remove torch code
* update envs.FD_MXFP4_BACKEND
* update process_weights_after_loading
* update env name
* support tp in gpt-oss, add e2e test
* add flashinfer-python-paddle in requirements
* fix import error
* add test
* add test
* add test
* add test
2026-01-22 14:21:01 +08:00
jc
309c7d9764
router support divided roolout ( #6150 )
2026-01-22 10:39:39 +08:00
fxyfxy777
9c4db0ac3f
[BugFix] fix weight quant op ( #6137 )
...
* fix weight quant
* fix weight quant
* bit equal
* code style
2026-01-22 09:50:57 +08:00