[BUG] POC prepare doesn't use overseer_agent sp_end_point ports from input project.yml
Closed this issue · 1 comments
Describe the bug
When running nvflare poc prepare -i project.yml
, the builders.args.overseer_agent.args.sp_end_point value for a DummyOverseerAgent is not reflected in the provisioned fed_server.json, fed_client.json, fed_admin.json files. This means that even if you change the admin and fed_learn ports in the project yaml and endpoint, the POC processes still try connecting to the default 8003/8002 ports.
To Reproduce
Steps to reproduce the behavior:
- Create a project.yml based off of the default POC config, but change the admin and fed_learn ports to 8005 and 8004:
- sp_end_point: server:8003:8002 + sp_end_point: server:8005:8004 - admin_port: 8003 + admin_port: 8005 - fed_learn_port: 8002 + fed_learn_port: 8004
api_version: 3 builders: - args: template_file: - master_template.yml - aws_template.yml - azure_template.yml path: nvflare.lighter.impl.workspace.WorkspaceBuilder - path: nvflare.lighter.impl.template.TemplateBuilder - args: config_folder: config overseer_agent: args: sp_end_point: server:8004:8005 overseer_exists: false path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent path: nvflare.lighter.impl.static_file.StaticFileBuilder - path: nvflare.lighter.impl.cert.CertBuilder - path: nvflare.lighter.impl.signature.SignatureBuilder description: NVIDIA FLARE sample project yaml file name: example_project participants: - admin_port: 8005 fed_learn_port: 8004 name: server org: nvidia type: server - name: admin@nvidia.com org: nvidia role: project_admin type: admin - name: site-1 org: nvidia type: client - name: site-2 org: nvidia type: client
- Run
nvflare poc prepare -i project.yml
with that file - Go to the provisioned file poc/example_project/prod_00/server/startup/fed_server.json and notice that the target and admin ports are properly set to 8004 and 8005, but the overseer_agent args still use sp_end_point: "localhost:8002:8003".
{ "format_version": 2, "servers": [ { "name": "example_project", "service": { "target": "localhost:8004", "scheme": "grpc" }, "admin_host": "localhost", "admin_port": 8005, "ssl_private_key": "server.key", "ssl_cert": "server.crt", "ssl_root_cert": "rootCA.pem" } ], "overseer_agent": { "args": { "sp_end_point": "localhost:8002:8003" }, "path": "nvflare.ha.dummy_overseer_agent.DummyOverseerAgent" } }
- The same overseer_agent sp_end_point can be seen in admin/startup/fed_admin.json or site/startup/fed_client.json
- If you continue to launch the POC
nvflare poc start
, the participants will try and fail to connect over the old 8002/8003 ports. This leads to a login error and the following logs:
# nvflare poc start
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/server/startup/..
PYTHONPATH is /local/custom:
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-1/startup/..
PYTHONPATH is /local/custom:
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-2/startup/..
PYTHONPATH is /local/custom:
start fl because of no pid.fl
new pid 34115
Trying to obtain server address
Obtained server address: localhost:8003
Trying to login, please wait ...
start fl because of no pid.fl
new pid 34133
2024-08-09 11:34:46,011 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - server heartbeat timeout set to 600
2024-08-09 11:34:46,155 - CoreCell - INFO - server: creating listener on grpc://0:8004
2024-08-09 11:34:46,186 - CoreCell - INFO - server: created backbone external listener for grpc://0:8004
2024-08-09 11:34:46,187 - ConnectorManager - INFO - 34115: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-08-09 11:34:46,188 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:11825] is starting
start fl because of no pid.fl
new pid 34142
Trying to login, please wait ...
Waiting for SP....
2024-08-09 11:34:46,693 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:11825
2024-08-09 11:34:46,693 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE grpc://0:8004] is starting
2024-08-09 11:34:46,694 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - deployed FLARE Server.
2024-08-09 11:34:46,706 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 8005
2024-08-09 11:34:46,706 - root - INFO - Server started
2024-08-09 11:34:46,709 - nvflare.fuel.f3.drivers.grpc_driver.Server - INFO - added secure port at 0.0.0.0:8004
2024-08-09 11:34:46,909 - CoreCell - INFO - site-1: created backbone external connector to grpc://localhost:8002
2024-08-09 11:34:46,909 - ConnectorManager - INFO - 34133: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-08-09 11:34:46,912 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:25585] is starting
2024-08-09 11:34:47,415 - CoreCell - INFO - site-1: created backbone internal listener for tcp://localhost:25585
2024-08-09 11:34:47,416 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://localhost:8002] is starting
2024-08-09 11:34:47,416 - FederatedClient - INFO - Wait for engine to be created.
2024-08-09 11:34:47,424 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at localhost:8002
2024-08-09 11:34:47,424 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 N/A => localhost:8002] is created: PID: 34133
2024-08-09 11:34:47,434 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 34133
2024-08-09 11:34:47,434 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00002 Not Connected]
Waiting for SP....
Expected behavior
I would like to be able to change the POC overseer ports so that I can have developers running multiple POCs on the same machine using different non-conflicting ports based on having separate project yamls.
Screenshots
See files/logs pasted above.
Desktop (please complete the following information):
- OS: MacOS Sonoma, Ubuntu 20.04
- Python Version 3.10.13
- NVFlare Version 2.4.1, 2.5.0rc1+29.ga276dc03
Additional context
N/A
good catch @parkeraddison, I will fix this.