* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
* fix(custom_ops): gate unsupported ops for sm70/sm75 build
* fix(custom_ops): gate deepgemm exports to sm75+ only
* [BugFix][OP] deduplicate CUDA sources to avoid moe_deepgemm multiple definition
* revert two custom_ops files to 352f922f9
* [BugFix] Cap nvcc -t threads to avoid compilation failures on high-core machines
On machines with many cores (e.g. 192), the nvcc -t flag was set to
os.cpu_count(), causing each nvcc process to spawn that many internal
threads. Combined with Paddle's ThreadPoolExecutor launching parallel
compilations (also based on cpu_count), this leads to ~28K+ threads,
resource exhaustion, and silent compilation failures. The linker then
cannot find the missing .o files, but a second build succeeds because
already-compiled objects are cached.
Cap nvcc -t at 4 to keep total parallelism reasonable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
- Add WRAPPER_CHECK_PTR for pointer validity checks
- Add WRAPPER_ASSERT_GT/GE/LE for parameter range validation
- Simplify wrapper function calls to direct return pattern
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path
- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance
* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size
- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add RMSNorm layer batch_invariant_mode unit test for coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add pragma no cover for Triton kernel and multi-GPU embedding path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
* [XPU] Add return value checks for all XPU kernel launches
- Add -fxpu-launch-return compiler flag in CMakeLists.txt to enable
kernel launch return values
- Add KERNEL_ASSERT_SUCCESS(ctx, ret_xre) checks after every XPU
kernel launch across 45 wrapper files (55 launch sites total)
- Covers both main wrapper/ and mtp_wrapper/ directories
- Properly handles multiple kernel launches in the same function
scope by reusing the ret_xre variable
* [XPU] code style fix
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>