JYChen
43ace7af25
[RL] support moe-topk use topk_reduce_func ( #7218 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:03 +08:00
AIbin
48d2bbeb74
fix dsa ( #7252 )
2026-04-08 20:21:38 +08:00
Yuanle Liu
1af7f80811
Revert "[BugFix][Speculative Decoding] Correct index calculation in speculate…" ( #7133 )
...
This reverts commit ba1aa1edff .
2026-04-01 06:54:23 -07:00
lonelygsh
ba1aa1edff
[BugFix][Speculative Decoding] Correct index calculation in speculate decoding operators ( #7121 )
...
- Fix accept_idx calculation in spec_set_value_by_stop_seqs
- Fix condition check from < to <= for token matching
- Fix accept_tokens indexing logic
- Remove unnecessary -1 in current_step comparison for max_think_len
Co-authored-by: guanshihui] <guanshihui@baidu.com >
2026-04-01 05:36:53 -07:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
freeliuzc
7a6c28781b
[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug ( #7005 )
...
* optimize attn_mask_offset and optimize mtp usage
* delete useless branch
* fix kernel format
* fix kernel runner
2026-03-25 01:52:06 -07:00
SUN Dong
6cff780fdb
[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment ( #6850 )
...
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization
* update
* update
* update
* support custom topk inDeepGemmFusedMoeMethod apply_tp
* apply_ep_prefill support moe_topk_select
* update
* add ut
* add ut
* add ut
* modity doc
* fix env and docs
* add ut
---------
Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com >
2026-03-24 11:12:39 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
AIbin
bf7e2424d0
[Optimization][Feature]Supports multiple batches of DSK-DSA. ( #6930 )
...
* support DSA_MUTI_BATCH
* update test topk
* update dsk-dsa
2026-03-20 15:59:22 +08:00
RichardWooSJTU
4ed483d20b
[BugFix] Fix ep compatibility issues & Optimize permute operator ( #6821 )
...
* fix ep compatibility issues & optimize permute operator
* fix ut
* fix ut
2026-03-17 10:32:11 +08:00
周周周
091e3c815d
Dsa clean code,add dsk_attn_write_cache baseline ( #6855 )
2026-03-16 11:01:14 +08:00
周周周
820eb60ec6
[Others] clean code ( #6839 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-14 11:09:28 +08:00
周周周
8c1a2827d3
DSA clean code ( #6827 )
2026-03-13 16:39:47 +08:00
freeliuzc
12f412448b
[Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch ( #6825 )
2026-03-12 23:48:24 -07:00
huicongyao
2e63d88f7a
[Optimization][Speculative Decoding]Fuse padding sampling params ( #6765 )
...
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
2026-03-12 05:05:15 -07:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
AIbin
1118351b27
[Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest ( #6762 )
...
* add deepseek model doc
* update deepseek model doc
* update deepseek model doc
* update deepseek model doc
* cwb suppor DSK_V32 Model
* update DSK_V32_DSA modeling
* Ibin Support DSK_DSA
* update kernel
* update yaml
* update requirements
* update pre_commit
* update model-runner
* fix CI bug
* del start.sh
* fix iluvatar_model_runner
* update DSA & add unitest
* update import deep_gemm
2026-03-11 15:52:54 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
huicongyao
0f718baaf2
[Speculative Decoding]Reformat input preprocess for spec decode ( #6501 )
...
* add speculate_pre_process kernel
* reduce one slice
* make d2h async && fix mtp bug for new pre_process
* fix
* add unitest
* fix: code stype formatting
* fix
* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
ming1753
97eee75677
[Feature] GPU Memory Optimization and Retirement of V0 Scheduler ( #6407 )
...
* Optim GPU Mem Usage
---------
Co-authored-by: huzesen <huzesen@baidu.com >
2026-02-28 15:07:43 +08:00
GoldPancake
2178f2829b
[Speculative Decoding] Support suffix decoding ( #6403 )
...
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
chen
a8ffcaa068
fix fa4 test ( #6408 )
2026-02-10 10:57:21 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00
周周周
6225439778
add PADDLE_ENFORCE ( #6321 )
2026-02-04 10:47:19 +08:00
周周周
8277b95fa6
remove speculate_get_padding_offset op ( #6308 )
2026-02-03 15:18:12 +08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
jc
7da5f54fb3
[CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler ( #6250 )
...
* Add unit test for swap_layout
* remove splitwise_scheduler test
2026-01-28 19:20:20 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
sunxin
27f8799f04
[Model Runner] Refactor execute_model for GPU async scheduling ( #6176 )
2026-01-28 14:19:33 +08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
周周周
aa57864c5b
remove unneeded para from flash_mask_attention ( #6218 )
2026-01-27 14:04:27 +08:00
周周周
0966df78dc
[Others] remove stop_nums ( #6182 )
2026-01-26 12:12:47 +08:00
lizexu123
f4902fe42d
[BugFix] fix wint2 ( #6109 )
...
* fix
* fix
* fix
2026-01-20 21:46:21 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
freeliuzc
17866c028e
add more cases for attention unit test ( #5931 )
2026-01-15 19:52:35 +08:00
sunxin
2533836dbb
[Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel ( #5880 )
...
* qk rmsnorm fused
* inplace
* glm
* fix
* add qknorm layer
* fix
* update
* fix qwen3 vl
* update rl baseline
* fix qwen3 vl moe
* test
* fix qwen vl moe rl
* fix
2026-01-12 05:10:21 -08:00
freeliuzc
9018ccf74e
[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes ( #5738 )
...
* fix attn_mask_offset in mtp with multi-step and pd-split-mode
* fix xpu operater register
* update pmtp multi-step mtp strategy in d-split -mode
* add note
* fix xpu register
2025-12-25 01:54:59 -08:00
lizexu123
6d323769dd
fix w4afp8 ( #5634 )
2025-12-22 13:39:41 +08:00
Yuanle Liu
cdc0004894
Revert "[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )" ( #5611 )
...
This reverts commit 73e1d6aa90 .
2025-12-17 13:59:06 +08:00
Yuanle Liu
867803ae10
[BugFix] fix speculate_limit_thinking_content_length ( #5590 )
...
* fix speculate_limit_thinking_content_length
* update
2025-12-16 04:31:45 -08:00
fxyfxy777
73e1d6aa90
[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )
...
* ue8m0
* add default arg
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-16 18:40:12 +08:00
lizexu123
95eab9f9ee
[Feature] support stop_token_ids ( #5399 )
...
* support stop_token_ids
* fix
* delete chinese
* support both
* delete print
2025-12-09 17:49:12 +08:00