* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path
- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance
* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size
- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add RMSNorm layer batch_invariant_mode unit test for coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add pragma no cover for Triton kernel and multi-GPU embedding path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>