* [Feature] Add Triton unified attention kernel for deterministic inference
Add a Triton-based unified extend attention kernel that processes both
prefix (cached) and extend (new) KV tokens through a single kernel with
unified kv_indices, ensuring identical accumulation order regardless of
cache hit/miss patterns.
Key components:
- _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV
cache support, and causal masking for prefix+extend
- Index building utilities: triton_cumsum_with_zero_prefix,
build_kv_indices_from_block_tables, build_unified_kv_indices,
_scatter_extend_kv_indices_kernel (all CUDA Graph compatible)
- pre_cache_len_concat_triton: GPU-only replacement for C++ op
- Reference implementations (_ref variants) for correctness validation
- Comprehensive tests: kernel correctness, split invariance,
determinism, production-scale, cross-validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Vectorize causal mask in test references for ~26x speedup
Replace triple Python for-loop with paddle.where vectorized mask in
naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>