* [BugFix] Force top_k=1 for greedy decoding when temperature=0
When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.
Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.
* Extract greedy decoding from FD_DETERMINISTIC_MODE guard
top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Update test_torch_model.py
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
* [Feature] Add Triton unified attention kernel for deterministic inference
Add a Triton-based unified extend attention kernel that processes both
prefix (cached) and extend (new) KV tokens through a single kernel with
unified kv_indices, ensuring identical accumulation order regardless of
cache hit/miss patterns.
Key components:
- _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV
cache support, and causal masking for prefix+extend
- Index building utilities: triton_cumsum_with_zero_prefix,
build_kv_indices_from_block_tables, build_unified_kv_indices,
_scatter_extend_kv_indices_kernel (all CUDA Graph compatible)
- pre_cache_len_concat_triton: GPU-only replacement for C++ op
- Reference implementations (_ref variants) for correctness validation
- Comprehensive tests: kernel correctness, split invariance,
determinism, production-scale, cross-validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Vectorize causal mask in test references for ~26x speedup
Replace triple Python for-loop with paddle.where vectorized mask in
naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path
- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance
* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size
- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add RMSNorm layer batch_invariant_mode unit test for coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add pragma no cover for Triton kernel and multi-GPU embedding path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
* cherry-pick Support Fully Async and PrefixCache step 1
* copy routing_indices_cache.py from 2.4
* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)
* cherry-pick [RL] Clear Requests status of R3 (#6569)
* delete code
* fix rename bug
* fix status shape bug
* fix ci
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>