Commit Graph

4862 Commits

Author SHA1 Message Date
SunLei 32b6900d01 fix code type (#6951) 2026-03-20 16:14:12 +08:00
AIbin bf7e2424d0 [Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930)
* support DSA_MUTI_BATCH

* update test topk

* update dsk-dsa
2026-03-20 15:59:22 +08:00
周周周 1c38da2118 Make seq_lens_this_time/decoder/encoder equal shape (#6942) 2026-03-20 15:31:52 +08:00
Zhang Yulong 2b10ebc1f1 [benchmark] Refactor debug logging and payload handling (#6949)
* Refactor debug logging and payload handling

* Update backend_request_func.py
2026-03-20 15:04:10 +08:00
Zhang Yulong 3a4e139f65 [Benchmark] fix multi turn (#6948) 2026-03-20 13:22:30 +08:00
cloudforge1 aca733b95c [CI]【Hackathon 10th Spring No.32】load_weight_utils unit test (#6740)
* 【Hackathon 10th Spring No.32】Unit test for load_weight_utils.py

* [CI]【Hackathon 10th Spring No.32】rewrite load_weight_utils unit test

* [CI]【Hackathon 10th Spring No.32】improve load_weight_utils coverage to 83%

- Add test_load_ep_checkpoint_basic: exercises EP checkpoint loading with minimal fixture
- Add test_composite_ep_branch: covers EP path in load_composite_checkpoint
- Add test_get_weight_iterator_unordered: covers unordered sharded safetensors path

* [CI]【Hackathon 10th Spring No.32】align load_weight_utils test with gold standard (tmp_path, split tests)

* [CI]【Hackathon 10th Spring No.32】add coverage tests for load_weight_utils

- Add test_is_layers_grouped: test layers_are_grouped() with grouped, interleaved, and no-layer keys
- Add test_save_model_bf16_cache: exercise save_model decorator with is_checkpoint_bf16=True
- Add test_composite_checkpoint_ep: test load_composite_checkpoint use_ep=True branch
- Add test_composite_checkpoint_rank_mismatch: test tp_size != rank_dirs ValueError
- Add test_composite_checkpoint_kv_quant: test float8_e4m3fn kv_cache path
- Add __main__ block for direct execution

* [CI]【Hackathon 10th Spring No.32】raise load_weight_utils test delta

* [CI]【Hackathon 10th Spring No.32】cover TP sequence-parallel MoE load branches

* test: add load_reordered_experts, pre-sharded, and empty-state tests


---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-03-20 13:14:30 +08:00
xjkmfa 3b203994e2 [Benchmark] Update Qwen3 vl 32k yaml (#6946) 2026-03-20 11:48:53 +08:00
xjkmfa a81116ad90 [Benchmark] Update Qwen3 vl dense yaml (#6945) 2026-03-20 11:26:47 +08:00
sunxin d77edf8fc9 opt wfp8afp8 triton moe (#6938) 2026-03-20 11:07:25 +08:00
mouxin 96b0ecea6b [Feature] Update Counter Release (#6943) 2026-03-20 10:51:37 +08:00
luukunn f4a79d4c00 [Optimization]Unified data processing for online and offline (#6891)
* remove process_request

* fix chat

* fix unit test

* remove process response

* fix unit test

* fix offline decode

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix sampling_params

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-19 21:56:09 +08:00
luukunn c3d8db85c4 [Optimization] Update ZMQ server (#6735)
* add batch zmq send reaponse

* update

* Revert "update"

This reverts commit 0234a25b47.

* update

* remove lock

* fix unit test

* add unit test

* add unit test

* pre commit

* add unit test

* fix unit test

* add unit test

* fix worker>1

* update zmq_worker_pid

* fix unit test

* fix unit test

* fix unit test

* add unit test

* fix unit test

* fix first token time

* fix logprobs

* add unit test

* op

* remore debug log

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-19 21:53:16 +08:00
cloudforge1 9148562ed0 [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充 (#6734)
* [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充

* [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充

* [CI]【Hackathon 10th Spring No.35】add __main__ block

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-19 17:45:21 +08:00
YuBaoku 7141db0e01 [CI] Optimize CI: update nightly test_image build workflow (#6937) 2026-03-19 17:39:01 +08:00
周周周 b1c800b64b remove load_up_proj_weight_first (#6932) 2026-03-19 17:21:34 +08:00
sunxin 33e01f22a8 [Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894)
* update sampling

* fix

* fix

* fix mtp

* fix test
2026-03-19 01:43:10 -07:00
YuBaoku 2b84a4276e [CI] Optimize CI: add timeout and cancel on PR close (#6933) 2026-03-19 15:54:30 +08:00
JYChen f95d8ca7df [RL] support qkrmsnorm use proxy-norm (#6862)
* support qkrmsnorm use paddle.nn.functional.rms_norm

* remove flags in fd
2026-03-18 23:27:26 -07:00
周周周 1a05744c4e nvfp4.py support ep (#6920) 2026-03-19 14:07:46 +08:00
周周周 c184a7cb69 remove source in weight_loader in moe.py (#6892) 2026-03-19 13:31:43 +08:00
Nyakku Shigure dd93f8ffb4 [Optimization] Skip compat guard when torch is not installed (#6913) 2026-03-19 11:29:27 +08:00
AIbin 4794a28f3d opt glm5 model (#6916) 2026-03-19 11:13:33 +08:00
jc dd55cda3c8 [CI] Add test for pd and cache storage (#6876)
* Add test for pd and cache storage

* up

* up

* fix bug

* fix bug

* up docker image

* up
2026-03-19 10:38:27 +08:00
gongweibao fb6c56dfd5 [BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 (#6748)
* [BugFix] Force top_k=1 for greedy decoding when temperature=0

When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.

Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.

* Extract greedy decoding from FD_DETERMINISTIC_MODE guard

top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update test_torch_model.py

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-18 17:36:43 +08:00
AIbin 9b117aafac support glm-moe-dsa model (#6863) 2026-03-18 17:21:55 +08:00
YuBaoku 07543685ec [CI] Isolate cache and ccache for CUDA 13.0 build 2026-03-18 11:41:46 +08:00
fxyfxy777 9660f98837 [BugFix] Set FD_USE_PHI_MOE_PERMUTE = 0 Default (#6886)
* FD_USE_PHI_MOE_PERMUTE = 0

* modify comments
2026-03-17 20:05:39 -07:00
yzwu 8b890c0d72 [Iluvatar] refactor attn and moe code (#6887) 2026-03-18 10:31:00 +08:00
YuBaoku 0359794e08 [CI] Sync _log_softmax_batch_invariant with paddle update (#6893) 2026-03-17 23:03:57 +08:00
mouxin 2a371a3450 [Feature] Update tpSize (#6896) 2026-03-17 20:20:39 +08:00
lizan1999 148eee84c6 [XPU] use quant2d_per_token for weight quant int8 && fix some XPU Kernel check (#6869) 2026-03-17 19:44:48 +08:00
Jiaxin Sui aa9deb6ad4 [XPU] Dockerfiles update (#6898)
* Update Dockerfile.xpu

* Add build script for XPU Docker image

* Refactor Dockerfile to conditionally install packages

Added conditional installation for requirements and fastdeploy.

* Reorder RUN commands in Dockerfile.xpu

* Update Dockerfile.xpu

* Delete dockerfiles/build_xpu.sh
2026-03-17 19:43:49 +08:00
gongweibao e4c9cac124 [BugFix] Cap nvcc -t threads to avoid compilation failures on high-co… (#6885)
* [BugFix] Cap nvcc -t threads to avoid compilation failures on high-core machines

On machines with many cores (e.g. 192), the nvcc -t flag was set to
os.cpu_count(), causing each nvcc process to spawn that many internal
threads. Combined with Paddle's ThreadPoolExecutor launching parallel
compilations (also based on cpu_count), this leads to ~28K+ threads,
resource exhaustion, and silent compilation failures. The linker then
cannot find the missing .o files, but a second build succeeds because
already-compiled objects are cached.

Cap nvcc -t at 4 to keep total parallelism reasonable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-17 19:27:45 +08:00
AIbin cb6819d086 [Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865)
* support per_token_group_fp8_quant cuda kernel

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* update code

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-17 19:17:51 +08:00
mouxin b61731bb96 [Feature][Docs] Adjust prefill release & expose load metrics (#6884) 2026-03-17 15:23:13 +08:00
Longzhi Wang daaf498213 [Feature] support compute shared experts before combine for better overlap (#6697)
* [Feature] support compute shared experts before combine for better overlap

* fix test

* fix xpu

* fix
2026-03-17 15:18:51 +08:00
Jiang-Jia-Jun 12eb001d0c Remove comments on multi-mode request handling
Removed comments about multi-mode scenarios and request pulling.
2026-03-17 14:49:00 +08:00
jc 950366e58d [PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718)
* [Feature] Register to router with version info for PD disaggregation

Add RegisterManager for PD (Prefill-Decode) disaggregated deployment:
- All instances (Prefill/Decode) register to Router with heartbeat
- Prefill instances fetch Decode instance list from Router
- Prefill instances establish eager RDMA connections to Decode instances
- Register info includes: host_ip, port, role, version, is_paused, connected_decodes

Changes:
- Add RegisterManager class for managing PD registration and RDMA connections
- Add version field to ModelConfig for model version tracking
- Add connected_decodes to register_info for tracking connected Decode instances
- Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable

Test fixes:
- Add None checks for load_config in FDConfig.__init__
- Add version attribute to test mock model configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refine

* remove test

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:43:35 +08:00
YuBaoku b152baeeee [CI] disable test_batch_invariance_op_logsoftmax.py in unit_test 2026-03-17 14:43:14 +08:00
周周周 ea998dd26f clean clean code in _load_per_tensor_weight_scale (#6868)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-17 14:06:57 +08:00
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
huicongyao eab429d05e fix performance drop while no spec (#6866) 2026-03-17 13:06:36 +08:00
luukunn fe8d58a094 [Optimization]update request in tool parser&reasoning parser (#6858)
* update request in tool parser&reasoning parser
2026-03-17 11:51:12 +08:00
RichardWooSJTU 4ed483d20b [BugFix] Fix ep compatibility issues & Optimize permute operator (#6821)
* fix ep compatibility issues & optimize permute operator

* fix ut

* fix ut
2026-03-17 10:32:11 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
Jiang-Jia-Jun d113397b09 Simplify available_blocks assignment logic (#6819) 2026-03-16 20:12:30 +08:00
Longzhi Wang 5c92f4d0cd [Feature] Add deepgemm bias epilogue for SM100 (#6857)
* [Feature] Add deepgemm bias epilogue for SM100

* fix
2026-03-16 20:12:00 +08:00
Jiang-Jia-Jun bd4b6092dd Update title and activity section in README_CN.md 2026-03-16 19:21:50 +08:00
Jiang-Jia-Jun c5f402e7aa Update title and release note in README_CN.md 2026-03-16 19:17:38 +08:00
AIbin c9f7f5234e [Optimization][BugFix]Optimize Deepseek networking code (#6861)
* update dsk model

* update dsk model
2026-03-16 16:52:43 +08:00