[BUG] torch examples fail with "Received command to abort job"

Question

[BUG] torch examples fail with "Received command to abort job"

Closed this issue 6 months ago · 9 comments

Describe the bug

When I submit jobs from the examples folder that use pytorch, for example hello-pt/jobs/hello-pt/, the job seems to submit fine but then quickly fails with 'FINISHED:EXECUTION_EXCEPTION'

DEBUG: receive_and_process ...
DEBUG: Server Reply: {'time': '2024-05-28 05:44:08.725447', 'data': [{'type': 'dict', 'data': {'name': 'hello-pt', 'resource_spec': {}, 'min_clients': 2, 'deploy_map': {'app': ['@ALL']}, 'submitter_name': 'dxxxxxx@xxxxxx.com', 'submitter_org': 'Test', 'submitter_role': 'lead', 'job_folder_name': 'hello-pt', 'job_id': 'f0a29fe6-e299-4a7a-a1f1-68de498ed08e', 'submit_time': 1716875042.4441018, 'submit_time_iso': '2024-05-28T05:44:02.444102+00:00', 'start_time': '2024-05-28 05:44:03.868655', 'duration': '0:00:04.783606', 'status': 'FINISHED:EXECUTION_EXCEPTION', 'job_deploy_detail': ['server: OK', 'HPC-A40: OK', 'HPC-x080ti: OK'], 'schedule_count': 1, 'last_schedule_time': 1716875043.3628101, 'schedule_history': ['2024-05-28 05:44:03: scheduled']}}], 'meta': {'status': 'ok', 'info': '', 'job_meta': {'name': 'hello-pt', 'resource_spec': {}, 'min_clients': 2, 'deploy_map': {'app': ['@ALL']}, 'submitter_name': 'dxxxxxx@xxxxxx.com', 'submitter_org': 'Test', 'submitter_role': 'lead', 'job_folder_name': 'hello-pt', 'job_id': 'f0a29fe6-e299-4a7a-a1f1-68de498ed08e', 'submit_time': 1716875042.4441018, 'submit_time_iso': '2024-05-28T05:44:02.444102+00:00', 'start_time': '2024-05-28 05:44:03.868655', 'duration': '0:00:04.783606', 'status': 'FINISHED:EXECUTION_EXCEPTION', 'job_deploy_detail': ['server: OK', 'HPC-A40: OK', 'HPC-x080ti: OK'], 'schedule_count': 1, 'last_schedule_time': 1716875043.3628101, 'schedule_history': ['2024-05-28 05:44:03: scheduled']}}}
DEBUG: reply received!
DEBUG: server_execute Done [72429 usecs] 2024-05-27 22:44:08.723304

The log.txt in the job folder on the nvflare client says "Received command to abort job" :

2024-05-28 03:06:20,493 - FederatedClient - INFO - Wait for client_runner to be created.
2024-05-28 03:06:20,494 - FederatedClient - INFO - Got client_runner after 0.0008237361907958984 seconds
2024-05-28 03:06:20,494 - FederatedClient - INFO - Got the new primary SP: grpc://myserver.mydomain.edu:8002
2024-05-28 03:06:20,495 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'
2024-05-28 03:06:20,503 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - created secure channel at myserver.mydomain.edu:8002
2024-05-28 03:06:20,504 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 N/A => myserver.mydomain.edu:8002] is created: PID: 12185
2024-05-28 03:06:21,006 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.85cd828b-4156-488b-8a0d-fa5a8c935b2f'], timeout=2.0
2024-05-28 03:06:22,572 - ClientRunner - INFO - [identity=AWS-T4, run=85cd828b-4156-488b-8a0d-fa5a8c935b2f]: Received command to abort job
2024-05-28 03:06:23,508 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.85cd828b-4156-488b-8a0d-fa5a8c935b2f'], timeout=2.0
2024-05-28 03:06:26,011 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.85cd828b-4156-488b-8a0d-fa5a8c935b2f'], timeout=2.0
2024-05-28 03:06:26,704 - ClientRunner - INFO - [identity=AWS-T4, run=85cd828b-4156-488b-8a0d-fa5a8c935b2f]: Received command to abort job
2024-05-28 03:06:28,513 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.85cd828b-4156-488b-8a0d-fa5a8c935b2f'], timeout=2.0
2024-05-28 03:06:31,015 - Cell - INFO - broadcast: channel='aux_communication', topic='__sync_runner__', targets=['server.85cd828b-4156-488b-8a0d-fa5a8c935b2f'], timeout=2.0

To Reproduce
I am using a simple Python job submission script submit-job.py:

import nvflare.fuel.flare_api.flare_api as nvf
sess = nvf.new_secure_session(
    username=username,
    startup_kit_location=authloc,
    debug=True,
    timeout=3600
)
job_folder = os.path.join(os.getcwd(), sys.argv[1])
job_id = sess.submit_job(job_folder)

and run the script with the hello-pt/jobs/hello-pt example

./submit-job.py hello-pt/jobs/hello-pt

Expected behavior

when i run

./submit-job.py hello-numpy-sag/jobs/hello-numpy-sag

it finishes successfully and log.txt looks much better

2024-05-27 22:50:30,011 - ClientRunner - INFO - [identity=HPC-x080ti, run=2275bcfb-ff30-481c-90cc-e5ed4a6e50f1]: started end-run events sequence                                                                    2024-05-27 22:50:30,012 - ClientRunner - INFO - [identity=HPC-x080ti, run=2275bcfb-ff30-481c-90cc-e5ed4a6e50f1]: ABOUT_TO_END_RUN fired                                                                             2024-05-27 22:50:30,012 - ClientRunner - INFO - [identity=HPC-x080ti, run=2275bcfb-ff30-481c-90cc-e5ed4a6e50f1]: Firing CHECK_END_RUN_READINESS ...                                                                 2024-05-27 22:50:30,013 - ClientRunner - INFO - [identity=HPC-x080ti, run=2275bcfb-ff30-481c-90cc-e5ed4a6e50f1]: END_RUN fired                                                                                      2024-05-27 22:50:30,013 - ReliableMessage - INFO - ReliableMessage is shutdown                            2024-05-27 22:50:30,028 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 69492
2024-05-27 22:50:30,035 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 69492
2024-05-27 22:50:30,035 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00003 Not Connected]                                                                               2024-05-27 22:50:30,036 - FederatedClient - INFO - Shutting down client run: HPC-x080ti
2024-05-27 22:50:30,177 - ClientRunner - INFO - [identity=HPC-x080ti, run=2275bcfb-ff30-481c-90cc-e5ed4a6e50f1]: Client is stopping ...
2024-05-27 22:50:31,463 - ReliableMessage - INFO - shutdown reliable message monitor
2024-05-27 22:50:31,686 - MPM - INFO - MPM: Good Bye!

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (the system where jobs were submitted):

OS: Debian 11.9 , Kernel 5.15
Python Version: 3.8 and 3.10
NVFlare Version: 2.4.1

Additional context

I tried both, Python 3.8 and Python 3.10, I also tried running the nvflare client on AWS (Ubuntu 22.04) and on HPC (RHEL7) and the results were always the same. I only tried nvflare 2.4.1

Answer 1 · 2024-05-28T17:41:51.000Z

I should add that in poc mode the hello-pt example just runs fine

2024-05-28 10:37:33,659 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, task_name=validate, task_id=7b75e073-9836-4afd-9c57-1429aa1c98ff]: start to send task result to server
2024-05-28 10:37:33,659 - FederatedClient - INFO - Starting to push execute result.
2024-05-28 10:37:33,661 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-2, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: got result from client site-2 for task: name=validate, id=7b75e073-9836-4afd-9c57-1429aa1c98ff
2024-05-28 10:37:33,661 - CrossSiteModelEval - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-2, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer_rc=OK, task_name=validate, task_id=7b75e073-9836-4afd-9c57-1429aa1c98ff]: Saved validation result from client 'site-2' on model 'site-2' in /tmp/nvflare/poc/example_project/prod_00/server/startup/../e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4/cross_site_val/result_shareables/site-2_site-2
2024-05-28 10:37:33,662 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-2, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer_rc=OK, task_name=validate, task_id=7b75e073-9836-4afd-9c57-1429aa1c98ff]: finished processing client result by cross_site_validate
2024-05-28 10:37:33,662 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-2   task_id:7b75e073-9836-4afd-9c57-1429aa1c98ff
2024-05-28 10:37:33,663 - Communicator - INFO -  SubmitUpdate size: 700B (700 Bytes). time: 0.004070 seconds
2024-05-28 10:37:33,663 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, task_name=validate, task_id=7b75e073-9836-4afd-9c57-1429aa1c98ff]: task result sent to server
2024-05-28 10:37:33,667 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, task_name=validate, task_id=2f75981c-d0d3-4e40-8717-67a3f114669c]: start to send task result to server
2024-05-28 10:37:33,667 - FederatedClient - INFO - Starting to push execute result.
2024-05-28 10:37:33,669 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-1, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: got result from client site-1 for task: name=validate, id=2f75981c-d0d3-4e40-8717-67a3f114669c
2024-05-28 10:37:33,669 - CrossSiteModelEval - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-1, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer_rc=OK, task_name=validate, task_id=2f75981c-d0d3-4e40-8717-67a3f114669c]: Saved validation result from client 'site-1' on model 'site-2' in /tmp/nvflare/poc/example_project/prod_00/server/startup/../e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4/cross_site_val/result_shareables/site-1_site-2
2024-05-28 10:37:33,669 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate, peer=site-1, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer_rc=OK, task_name=validate, task_id=2f75981c-d0d3-4e40-8717-67a3f114669c]: finished processing client result by cross_site_validate
2024-05-28 10:37:33,670 - SubmitUpdateCommand - INFO - submit_update process. client_name:site-1   task_id:2f75981c-d0d3-4e40-8717-67a3f114669c
2024-05-28 10:37:33,671 - Communicator - INFO -  SubmitUpdate size: 700B (700 Bytes). time: 0.003737 seconds
2024-05-28 10:37:33,671 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, task_name=validate, task_id=2f75981c-d0d3-4e40-8717-67a3f114669c]: task result sent to server
2024-05-28 10:37:33,995 - CrossSiteModelEval - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: task validate exit with status TaskCompletionStatus.OK
2024-05-28 10:37:34,367 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: Workflow: cross_site_validate finalizing ...
2024-05-28 10:37:34,497 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: ABOUT_TO_END_RUN fired
2024-05-28 10:37:34,507 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: received request from Server to end current RUN
2024-05-28 10:37:34,509 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, peer=example_project, peer_run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: received request from Server to end current RUN
2024-05-28 10:37:34,502 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: Firing CHECK_END_RUN_READINESS ...
2024-05-28 10:37:34,512 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: END_RUN fired
2024-05-28 10:37:34,513 - ReliableMessage - INFO - ReliableMessage is shutdown
2024-05-28 10:37:34,513 - ServerRunner - INFO - [identity=example_project, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4, wf=cross_site_validate]: Server runner finished.
2024-05-28 10:37:35,664 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: started end-run events sequence
2024-05-28 10:37:35,665 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: ABOUT_TO_END_RUN fired
2024-05-28 10:37:35,666 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: Firing CHECK_END_RUN_READINESS ...
2024-05-28 10:37:35,667 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: END_RUN fired
2024-05-28 10:37:35,668 - ReliableMessage - INFO - ReliableMessage is shutdown
2024-05-28 10:37:35,674 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: started end-run events sequence
2024-05-28 10:37:35,674 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: ABOUT_TO_END_RUN fired
2024-05-28 10:37:35,675 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: Firing CHECK_END_RUN_READINESS ...
2024-05-28 10:37:35,676 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: END_RUN fired
2024-05-28 10:37:35,676 - ReliableMessage - INFO - ReliableMessage is shutdown
2024-05-28 10:37:35,686 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 183560
2024-05-28 10:37:35,686 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 184037
2024-05-28 10:37:35,688 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 183555
2024-05-28 10:37:35,688 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 Not Connected] is closed PID: 184038
2024-05-28 10:37:35,784 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00005 Not Connected] is closed PID: 183550
2024-05-28 10:37:35,785 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00006 Not Connected] is closed PID: 183550
2024-05-28 10:37:35,796 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 184038
2024-05-28 10:37:35,797 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00003 Not Connected]
2024-05-28 10:37:35,798 - FederatedClient - INFO - Shutting down client run: site-2
2024-05-28 10:37:35,798 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00003 Not Connected] is closed PID: 184037
2024-05-28 10:37:35,799 - nvflare.fuel.f3.drivers.grpc_driver.GrpcDriver - INFO - CLIENT: finished connection [CN00003 Not Connected]
2024-05-28 10:37:35,800 - FederatedClient - INFO - Shutting down client run: site-1
2024-05-28 10:37:35,904 - ClientRunner - INFO - [identity=site-1, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: Client is stopping ...
2024-05-28 10:37:35,911 - ClientRunner - INFO - [identity=site-2, run=e4a7a391-1f61-43d4-a5b0-ba4f96e6dac4]: Client is stopping ...
2024-05-28 10:37:35,978 - ReliableMessage - INFO - shutdown reliable message monitor
2024-05-28 10:37:35,979 - ReliableMessage - INFO - shutdown reliable message monitor
2024-05-28 10:37:36,396 - ReliableMessage - INFO - shutdown reliable message monitor
2024-05-28 10:37:37,412 - MPM - INFO - MPM: Good Bye!
2024-05-28 10:37:37,418 - MPM - INFO - MPM: Good Bye!
2024-05-28 10:37:37,423 - FederatedServer - INFO - Server app stopped.

Answer 2 · 2024-05-28T19:18:07.000Z

@dirkpetersen thanks for the issue and detailed logs!

So the problem happens when you use hello-pt/jobs/hello-pt with "real-world deployment".

Can you check if you have installed all the requirements on all of your client sites?
Each client machine should do:

pip install nvflare==2.4.1
pip install -r requirements.txt (https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/hello-pt)
prepare the dataset, as for the hello-pt example, we just assume everyone has the same dataset: bash ./prepare_data.sh

Answer 3 · 2024-05-28T19:39:02.000Z

Yes, I can confirm that all clients have nvflare 2.4.1, the latest versions of torch and torchvision and the cifar data installed (BTW: the prepare data is actually not required for hello-pt because it will download and deploy that automatically.)

Answer 4 · 2024-05-28T19:40:50.000Z

I was also running into an SSL issue with some versions of Python 3.10 (but but not Python 3.8) that required a workaround. I am continuing testing with 3.8 and a version of 3.10 that does NOT require this workaround : https://github.com/dirkpetersen/nvflare-cancer?tab=readme-ov-file#ssl-issue

Answer 5 · 2024-05-28T19:46:27.000Z

@dirkpetersen thanks for the confirmation and you are right!

In this case, can you share the logs of your server and clients? (Can be found in server_workspace/log.txt , server_workspace/[job_id]/log.txt, client_workspace/log.txt and client_workspace/[job_id]/log.txt) (Refer to documentation for details)

Answer 6 · 2024-05-28T23:08:34.000Z

OK, here are the files, The server was installed to aws with the startup/start.sh --cloud aws feature and the workspace does not show any job logs (see ls -l below), am i missing something there ?

log-client-1.txt
log-client-2.txt
log-job-client-1.txt
log-job-client-2.txt
log-server.txt

ubuntu@ip-xxx-xx-x-xxx:~$ ls -l /var/tmp/cloud/
total 404
-r-------- 1 ubuntu ubuntu   1675 May 14 16:23 NVFlareServerKeyPair.pem
-rw-rw-r-- 1 ubuntu ubuntu  73904 May 28 21:17 audit.log
-rw-rw-r-- 1 ubuntu ubuntu      4 May 26 14:27 daemon_pid.fl
drwxr-xr-x 2 ubuntu ubuntu   4096 May 14 16:23 local
-rw-rw-r-- 1 ubuntu ubuntu 239096 May 28 21:19 log.txt
-rw-rw-r-- 1 ubuntu ubuntu      4 May 26 14:28 pid.fl
-rw-r--r-- 1 ubuntu ubuntu    588 May 14 16:23 readme.txt
drwxr-xr-x 2 ubuntu ubuntu   4096 May 14 16:24 startup
drwxrwxr-x 2 ubuntu ubuntu   4096 May 14 16:24 transfer
-rw-r--r-- 1 ubuntu ubuntu   4473 May 14 16:23 vm_create.json
-rw-r--r-- 1 ubuntu ubuntu   7018 May 14 16:23 vm_result.json

Answer 7 · 2024-05-30T03:03:54.000Z

@dirkpetersen thanks for sharing, I see in your server log
ServerEngine - INFO - Job: 8f715bcd-e371-49fb-bfdb-1717542d2f3d child process exit with return code 103

That means the server job process died somehow.
Without the log I can't see what happened.

Can you use the admin CLI to login, and use command download_job 8f715bcd-e371-49fb-bfdb-1717542d2f3d to get the whole workspace folder, (it will be downloaded to the "transfer" folder of admin client)?

@IsaacYangSLA since user is using aws script, do you have any insight on this?

Answer 8 · 2024-05-31T18:29:37.000Z

Sorry for the delay ..... oh ... I am seeing this now ... pytorch was not installed on the server and then the server process bombed out ... but where does this server log that i get with download_job come from ?..... I did not find it under /var/tmp/cloud/ on the server ... I also did not realize that pytorch was a requirement on the server ... would the server even benefit from having a GPU ? I would perhaps recommend to install tools by default that are required to run the examples and not rely on the user to put a requirements.txt into startup.

The hello-pt example works now, thank you !!!

Answer 9 · 2024-06-03T23:41:22.000Z

@dirkpetersen glad it works for you!

but where does this server log that i get with download_job come from ?..... I did not find it under /var/tmp/cloud/ on the server

So on the FL server side, we have the server parent process (SP) that will be spawning up new server job process (SJ) when a job is deployed and started.
The reason we want to separate them because the job process could have custom user codes and it could crash easily due to various reasons like code bugs / OOM etc. We don't want it to interrupt our SP.

To answer your question, this log is from the SJ.
While the log you shared from /var/tmp/cloud is SP.

You can find more details here: https://nvflare.readthedocs.io/en/stable/real_world_fl/workspace.html

I also did not realize that pytorch was a requirement on the server I would perhaps recommend to install tools by default that are required to run the examples and not rely on the user to put a requirements.txt into startup.

So the requirements vary from job to job.
For hello-pt example, we are using the PTModelPersistor on the server side (as configured in config_fed_server.json)
So we need to have pytorch installed at the server.

For other jobs, you can have a different model persistor on the server side that requires NO extra dependencies, so this is really up to the users to decide for their applications.

would the server even benefit from having a GPU ?

It could, depends on the components you configured/added on the server job configuration.

I would perhaps recommend to install tools by default that are required to run the examples and not rely on the user to put a requirements.txt into startup.

We will see how to make the requirement management easier for the users.
As you see, each job is different.
One way user can install pytorch is doing: pip install nvflare[PT]