周周周
cbdb2462ea
cp 1131 tbo to develop ( #6281 )
2026-02-03 15:23:23 +08:00
周周周
8277b95fa6
remove speculate_get_padding_offset op ( #6308 )
2026-02-03 15:18:12 +08:00
Moonchild1227
39dc4b0c2e
[Feature] [KVCache] support file_store kv cache backend ( #6188 )
...
* fix(examples): comment out stop.sh to avoid error when script is missing
* feat: add file_store support for cache manager
* [fix] fix multi gpu transfer
* [fix] fix global kvcache transfer
* [Feature] [KVCache] support file_store kv cache backend
* chore: update FileStore according to PR comments
* fix: remove comments
* fix: add swap_cache_layout for file store
* fix: remove rank key
* fix: Switch KV cache storage to pure file mode
* Temporarily disable support for Tensor types
* fix: remove args --kvcache_file_path & add envs FILE_BACKEND_STORAGE_DIR
* fixx: Simplify cache_transfer_manager.py
* fix: fix syntax bug
* fix: Simplify file_store.py
* fix: Use the key directly as the filename
* fix: Simplify set()
* fix: Simplify cache_transfer_manager.py & file_store.py
* fix: Only support load to cpu buffer
* feat: add FileStore backend for cache transfer
* fix: guard zmq import
2026-02-03 14:37:58 +08:00
zccjjj
ee77ff9ebe
[config] fix assert message ( #6310 )
2026-02-03 14:37:46 +08:00
Jingfeng Wu
4760835789
Fix heartbeat signal's sleeptime error ( #6241 )
2026-02-03 14:28:51 +08:00
xjkmfa
e27a7cc5b0
[Benchmark] Ce qwen3 vl ( #6288 )
...
* [CE]qwen3-vl
2026-02-03 14:17:28 +08:00
fxyfxy777
f3413c4caa
[BugFix] fix fused_mask_swiglu_fp8_quant bug ( #6316 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
* Revert "[Optimize] optimize mask_quant & swiglu (#6222 )"
This reverts commit 2ada119a38 .
* add block_size
* pre-commit
2026-02-03 13:54:12 +08:00
ApplEOFDiscord
6563b8307c
[Bug Fix] fix tokenizer oom ( #6287 )
...
* fix tokenizer oom
* fix unit test
2026-02-03 11:27:11 +08:00
GoldPancake
fb374238e1
Revert "[RL] Support GLM MTP RL Model ( #6223 )" ( #6301 )
...
This reverts commit af6c84d48d .
2026-02-02 14:08:13 +08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xunyoyo
25656455ee
[CI] 【Hackathon 10th Spring No.38】功能模块 fastdeploy/entrypoints/openai/serving_completion.py单测补充 ( #6227 )
...
* Add serving completion tests
* test: tighten serving completion coverage
2026-02-02 12:53:04 +08:00
chenjian
af1b1d2d56
[Feature] Support report token index by attention store ( #6285 )
...
* [Feature] Support report token index by attention store
* fix format
2026-02-02 10:41:11 +08:00
kesmeey
afee0b9c5e
[CI] 【Hackathon 10th Spring No.30】功能模块 fastdeploy/inter_communicator/engine_worker_queue.py单测补充 ( #6102 )
...
* test: add comprehensive tests for EngineWorkerQueue to improve code coverage
* style: format tests/inter_communicator/test_e2w_queue.py with black
---------
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
2026-01-30 21:37:29 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
xunyoyo
18ebce9dec
[CI] 【Hackathon 10th Spring No.41】功能模块 fastdeploy/entrypoints/llm.py 单测补充 ( #6108 )
...
* Add LLM entrypoint tests for coverage
* test: streamline llm entrypoint coverage
* test: format llm tests
2026-01-30 12:58:10 +08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
chenjian
292bab7e6d
[BugFix] Fix bug for enable output caching ( #6226 )
...
* [BugFix] Fix bug for enable output caching
* fix
* Fix
* fix
* fix ci
2026-01-30 10:55:36 +08:00
mouxin
506f1545cd
[Feature] Enhance Router with /v1/completions, docs, scripts, and version info ( #5966 )
...
* [Doc] Update prerequisites in the documentation
* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info
* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info
---------
Co-authored-by: mouxin <mouxin@baidu.com >
2026-01-30 10:28:48 +08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
Zhang Yulong
f3c12be4d2
Update _build_linux_rl.yml ( #6274 )
2026-01-29 19:10:47 +08:00
YuBaoku
bb7c1d13e1
[CI] Remove --ipc=host and --pid=host from _stable_test.yml ( #6270 )
2026-01-29 17:06:06 +08:00
Ryan
5e78c1ac87
[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode ( #6196 )
...
* refine comment && refine variable name
* replace comment
2026-01-29 16:29:54 +08:00
周周周
e237313797
[BugFix] allow return code 250 in tests/distributed/test_fusedmoe_ep_entry.py ( #6269 )
2026-01-29 16:00:03 +08:00
yuxuan
44b52701f6
[Feature] Support NVFP4 MoE on SM100 ( #6003 )
...
* fp4 dense
* [WIP] support nvfp4, dense part
* [wip] developing loading qwen model
* loading
* update
* dense fp4 OK, cudagraph error
* [WIP] moe forward part
* with flashinfer-backend
* qwen3_moe_fp4
* update
* support flashinfer-cutlass moe, qwen3-moe-fp4 OK
* support ernie4.5-fp4
* fix load error
* add some ut
* add docs
* fix CLA, test
* fix the apply() in ModelOptNvFp4FusedMoE
* fix CodeStyle
* del the PADDLE_COMPATIBLE_API
* fix broken url: nvidia_gpu.md
* fix docs
* fix token_ids
* fix CI in Hopper
* move flashinfer imports inside the function
* fix model_runner
Removed the logic for generating random padding IDs.
* Remove skip condition for CUDA version in nvfp4 test
* add test for nvfp4
* fix according to review
* Add Chinese translation link to NVFP4 documentation
* del flashinfer.py
* fix unittest
---------
Co-authored-by: zoooo0820 <zoooo0820@qq.com >
Co-authored-by: bukejiyu <395822456@qq.com >
2026-01-29 14:16:07 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
GoldPancake
af6c84d48d
[RL] Support GLM MTP RL Model ( #6223 )
...
* support glm mtp rl model
* fix
* fix
* fix ut
* update baseline
2026-01-28 08:28:03 -08:00
YuBaoku
b07b76e03f
[CI] Fix nightly cu129 build_outputs upload failure ( #6264 )
2026-01-28 23:39:39 +08:00
jc
7da5f54fb3
[CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler ( #6250 )
...
* Add unit test for swap_layout
* remove splitwise_scheduler test
2026-01-28 19:20:20 +08:00
ddchenhao66
6d33d5e370
[Models][BugFix] shared experts and dense mlp layer do not require TP split ( #6180 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-28 18:58:19 +08:00
chenjian
6e9a57b7c1
[Bug fix] Fix multi modal fetch feature ( #6095 )
2026-01-28 18:02:26 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
sunxin
27f8799f04
[Model Runner] Refactor execute_model for GPU async scheduling ( #6176 )
2026-01-28 14:19:33 +08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
YuBaoku
85db063da6
[CI] Fix workflow validation error in publish_job
2026-01-28 10:44:30 +08:00
Yuanle Liu
8b05774fad
[Others] enhance deep_ep import and support mixed mode flash_mask_attn ( #6238 )
...
* support flashmaskattn mixed and enhance deepep import
* update
* fix
2026-01-28 00:02:02 +08:00
YuBaoku
029cceec33
[CI] Switch nightly build to use FD_UNIFY_BUILD ( #6246 )
...
* [CI] Adapt build script for unified and arch-specific builds
* [CI] Switch nightly build to use FD_UNIFY_BUILD
2026-01-27 23:53:42 +08:00
YuBaoku
d975f6acdd
[CI] adjust resource scheduling of _stable_test ( #6235 )
2026-01-27 22:31:13 +08:00
Divano
ba9d2a9e5a
[CI] add update weights tests ( #6242 )
2026-01-27 20:54:21 +08:00
qwes5s5
38378415c7
add token ratio metrics ( #6236 )
2026-01-27 17:00:49 +08:00
ophilia-lee
1705d0af7a
[benchmark]支持SGLang/VLLM获取cached tokens ( #6240 )
...
* benchmark工具支持受限解码场景指定response_format
* Update backend_request_func.py
output.success判断兼容思考内容超长截断时回复内容为空的情况
* Update benchmark_serving.py
更新benchmark_metrics
* 支持Completions接口
* 支持Completions接口
* 支持Completions接口
* [Benchmark]支持Completions接口
* [Benchmark]支持Completions接口
* [Benchmark]async_request_eb_openai_completions 调大aiohttp 默认读 buffer size至4M,解决streaming 返回块过大报Chunk too big问题
* [Benchmark]调大aiohttp 默认读 buffer size至10M,解决streaming 返回块过大报Chunk too big问题
* [Benchmark]支持获取vLLM/SGLang cached_tokens
[Benchmark]支持获取vLLM/SGLang cached_tokens
* [benchmark]支持SGLang/VLLM获取cached tokens
[benchmark]支持SGLang/VLLM获取cached tokens
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-27 14:57:20 +08:00
周周周
aa57864c5b
remove unneeded para from flash_mask_attention ( #6218 )
2026-01-27 14:04:27 +08:00
Jiaxin Sui
f1cee7fd5e
[XPU] [CI] XPU CI Updata ( #6211 )
...
* Update log file path in test_pd_21b_ep4tp1.py
* Update log file path in test_pd_21b_ep4tp4.py
* Update log file path in test_pd_p_tp4ep4_d_tp1ep4
2026-01-27 11:45:53 +08:00
jc
b1698a79cb
[RL] add version to the key of cache storage && refine raising error ( #6160 )
...
* Waiting for cache transfer manager inited
* up
* up
* up
* up
* up
* fix according comments
* fix unittest
* fix
* fix unittest
* fix error
* pass storage_backend to worker
2026-01-27 10:47:46 +08:00
xiaoxiaohehe001
7ffa88bb01
[BugFix] fix mask_attn ( #6214 )
...
* [BugFix] fix mask attn
* [BugFix] fix mask attn
2026-01-26 07:46:51 -08:00
yangjianfengo1
b3627b59f8
[Bug Fix] fix mask attention ( #6216 )
2026-01-26 07:46:26 -08:00
yinwei
56d01f7e49
[XPU][CI]Add Cuda Graph CI Case ( #6229 )
...
* add cuda graph ci case
2026-01-26 23:20:44 +08:00
CSWYF3634076
08c411518f
[Loader] support dummy load weight ( #6169 )
...
* [Loader] support dummy load weight
* [Loader] support dummy load weight v2
* [Loader] support dummy load weight unittest
* [Loader] support dummy load weight unittest v2
* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin
adc69c15d0
[Model Runner] Prepare token count and move FA3 initialization into the graph ( #6170 )
...
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周
0966df78dc
[Others] remove stop_nums ( #6182 )
2026-01-26 12:12:47 +08:00
wangyifei
84a1780814
[build] support build sm 80,86,89,90 to one whl package ( #6173 )
...
* support build sm 80,86,89,90 to one whl package
* create tmp dir before build custom ops in FD_UNIFY_BUILD mode
* typo fix
* ignore exceptions in xpu ..
2026-01-26 11:30:02 +08:00