NVIDIA/NVFlare

[BUG] POC prepare doesn't use overseer_agent sp_end_point ports from input project.yml

Closed this issue · 1 comments

Describe the bug
When running nvflare poc prepare -i project.yml, the builders.args.overseer_agent.args.sp_end_point value for a DummyOverseerAgent is not reflected in the provisioned fed_server.json, fed_client.json, fed_admin.json files. This means that even if you change the admin and fed_learn ports in the project yaml and endpoint, the POC processes still try connecting to the default 8003/8002 ports.

To Reproduce
Steps to reproduce the behavior:

  1. Create a project.yml based off of the default POC config, but change the admin and fed_learn ports to 8005 and 8004:
    - sp_end_point: server:8003:8002
    + sp_end_point: server:8005:8004
    - admin_port: 8003
    + admin_port: 8005
    - fed_learn_port: 8002
    + fed_learn_port: 8004
    api_version: 3
    builders:
    - args:
        template_file:
        - master_template.yml
        - aws_template.yml
        - azure_template.yml
      path: nvflare.lighter.impl.workspace.WorkspaceBuilder
    - path: nvflare.lighter.impl.template.TemplateBuilder
    - args:
        config_folder: config
        overseer_agent:
          args:
            sp_end_point: server:8004:8005
          overseer_exists: false
          path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent
      path: nvflare.lighter.impl.static_file.StaticFileBuilder
    - path: nvflare.lighter.impl.cert.CertBuilder
    - path: nvflare.lighter.impl.signature.SignatureBuilder
    description: NVIDIA FLARE sample project yaml file
    name: example_project
    participants:
    - admin_port: 8005
      fed_learn_port: 8004
      name: server
      org: nvidia
      type: server
    - name: admin@nvidia.com
      org: nvidia
      role: project_admin
      type: admin
    - name: site-1
      org: nvidia
      type: client
    - name: site-2
      org: nvidia
      type: client
  2. Run nvflare poc prepare -i project.yml with that file
  3. Go to the provisioned file poc/example_project/prod_00/server/startup/fed_server.json and notice that the target and admin ports are properly set to 8004 and 8005, but the overseer_agent args still use sp_end_point: "localhost:8002:8003".
    {
      "format_version": 2,
      "servers": [
        {
          "name": "example_project",
          "service": {
            "target": "localhost:8004",
            "scheme": "grpc"
          },
          "admin_host": "localhost",
          "admin_port": 8005,
          "ssl_private_key": "server.key",
          "ssl_cert": "server.crt",
          "ssl_root_cert": "rootCA.pem"
        }
      ],
      "overseer_agent": {
        "args": {
          "sp_end_point": "localhost:8002:8003"
        },
        "path": "nvflare.ha.dummy_overseer_agent.DummyOverseerAgent"
      }
    }
  4. The same overseer_agent sp_end_point can be seen in admin/startup/fed_admin.json or site/startup/fed_client.json
  5. If you continue to launch the POC nvflare poc start, the participants will try and fail to connect over the old 8002/8003 ports. This leads to a login error and the following logs:
# nvflare poc start
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/server/startup/..
PYTHONPATH is /local/custom:
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-1/startup/..
PYTHONPATH is /local/custom:
WORKSPACE set to /Users/paddison/repos/FedRAG/outputs/poc/example_project/prod_00/site-2/startup/..
PYTHONPATH is /local/custom:
start fl because of no pid.fl
new pid 34115
Trying to obtain server address
Obtained server address: localhost:8003
Trying to login, please wait ...
start fl because of no pid.fl
new pid 34133
2024-08-09 11:34:46,011 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - server heartbeat timeout set to 600
2024-08-09 11:34:46,155 - CoreCell - INFO - server: creating listener on grpc://0:8004
2024-08-09 11:34:46,186 - CoreCell - INFO - server: created backbone external listener for grpc://0:8004
2024-08-09 11:34:46,187 - ConnectorManager - INFO - 34115: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-08-09 11:34:46,188 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:11825] is starting
start fl because of no pid.fl
new pid 34142
Trying to login, please wait ...
Waiting for SP....
2024-08-09 11:34:46,693 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:11825
2024-08-09 11:34:46,693 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE grpc://0:8004] is starting
2024-08-09 11:34:46,694 - nvflare.private.fed.app.deployer.server_deployer.ServerDeployer - INFO - deployed FLARE Server.
2024-08-09 11:34:46,706 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 8005
2024-08-09 11:34:46,706 - root - INFO - Server started
2024-08-09 11:34:46,709 - nvflare.fuel.f3.drivers.grpc_driver.Server - INFO - added secure port at 0.0.0.0:8004
2024-08-09 11:34:46,909 - CoreCell - INFO - site-1: created backbone external connector to grpc://localhost:8002
2024-08-09 11:34:46,909 - ConnectorManager - INFO - 34133: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2024-08-09 11:34:46,912 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:25585] is starting
2024-08-09 11:34:47,415 - CoreCell - INFO - site-1: created backbone internal listener for tcp://localhost:25585
2024-08-09 11:34:47,416 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://localhost:8002] is starting
2024-08-09 11:34:47,416 - FederatedClient - INFO - Wait for engine to be created.
2024-08-09 11:34:47,424 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at localhost:8002
2024-08-09 11:34:47,424 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 N/A => localhost:8002] is created: PID: 34133
2024-08-09 11:34:47,434 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 34133
2024-08-09 11:34:47,434 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00002 Not Connected]
Waiting for SP....

Expected behavior
I would like to be able to change the POC overseer ports so that I can have developers running multiple POCs on the same machine using different non-conflicting ports based on having separate project yamls.

Screenshots
See files/logs pasted above.

Desktop (please complete the following information):

  • OS: MacOS Sonoma, Ubuntu 20.04
  • Python Version 3.10.13
  • NVFlare Version 2.4.1, 2.5.0rc1+29.ga276dc03

Additional context
N/A

good catch @parkeraddison, I will fix this.