[PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718)

* [Feature] Register to router with version info for PD disaggregation

Add RegisterManager for PD (Prefill-Decode) disaggregated deployment:
- All instances (Prefill/Decode) register to Router with heartbeat
- Prefill instances fetch Decode instance list from Router
- Prefill instances establish eager RDMA connections to Decode instances
- Register info includes: host_ip, port, role, version, is_paused, connected_decodes

Changes:
- Add RegisterManager class for managing PD registration and RDMA connections
- Add version field to ModelConfig for model version tracking
- Add connected_decodes to register_info for tracking connected Decode instances
- Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable

Test fixes:
- Add None checks for load_config in FDConfig.__init__
- Add version attribute to test mock model configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refine

* remove test

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
jc
2026-03-17 14:43:35 +08:00
committed by GitHub
parent b152baeeee
commit 950366e58d
14 changed files with 507 additions and 97 deletions
+1
View File
@@ -1163,6 +1163,7 @@ trace_logger = FastDeployLogger().get_trace_logger("trace", "trace.log")
router_logger = get_logger("router", "router.log")
fmq_logger = get_logger("fmq", "fmq.log")
obj_logger = get_logger("obj", "obj.log") # debug内存问题
register_manager_logger = get_logger("register_manager", "register_manager.log")
def parse_type(return_type: Callable[[str], T]) -> Callable[[str], T]: