Commit Graph

5 Commits

Author SHA1 Message Date
jc 950366e58d [PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718)
* [Feature] Register to router with version info for PD disaggregation

Add RegisterManager for PD (Prefill-Decode) disaggregated deployment:
- All instances (Prefill/Decode) register to Router with heartbeat
- Prefill instances fetch Decode instance list from Router
- Prefill instances establish eager RDMA connections to Decode instances
- Register info includes: host_ip, port, role, version, is_paused, connected_decodes

Changes:
- Add RegisterManager class for managing PD registration and RDMA connections
- Add version field to ModelConfig for model version tracking
- Add connected_decodes to register_info for tracking connected Decode instances
- Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable

Test fixes:
- Add None checks for load_config in FDConfig.__init__
- Add version attribute to test mock model configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refine

* remove test

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:43:35 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
Jingfeng Wu 7d44009f39 [FDConfig] transfer metrics_port (#6056)
* transfer metrics_port

* transfer metrics_port
2026-01-19 19:58:57 +08:00
Juncai 0925d44f18 [PD Disaggregation] support different tp_size for prefill and decode (#5296)
* up

* up

* up

* fix
2025-12-01 17:50:20 +08:00
Juncai 08ca0f6aea [Feature] [PD] add simple router and refine splitwise deployment (#4709)
* add simple router and refine splitwise deployment

* fix
2025-11-06 14:56:02 +08:00