/SWARM

Swarm Learning for computational pathology

Primary LanguagePython

SWARM Learning For Histopathology Image Analysis

Background

The objective of this repository is to reproduce the Swarm Learning experiments described in Saldanha et al., biorxiv, 2021. This study demonstrates the feasibility of decentralized training of AI systems in computational pathology via Swarm Learning. The basic procedure was previously used for transcriptomics data in Warnat-Herresthal et al., Nature 2021. Like Warnat-Herresthal et al., we use HPE Swarm Learning as the core Swarm Learning package in our pipeline. In this repository, we describe the pipeline which integrates Swarm Learning with an end-to-end computational pathology workflow. The pathology image analysis workflow (single-center) was described Ghaffari Laleh et al., biorxiv 2021 as well as in several previous papers including Kather et al., Nature Medicine 2019.

Please cite our publication if you use this for your research:

Oliver Lester Saldanha, Philip Quirke, Nicholas P. West, Jacqueline A. James, Maurice B. Loughrey, Heike I. Grabsch, Manuel Salto-Tellez, Elizabeth Alwers, Didem Cifci, Narmin Ghaffari Laleh, Tobias Seibel, Richard Gray, Gordon G. A. Hutchins, Hermann Brenner, Tanwei Yuan, Titus J. Brinker, Jenny Chang-Claude, Firas Khader, Andreas Schuppert, Tom Luedde, Sebastian Foersch, Hannah Sophie Muti, Christian Trautwein, Michael Hoffmeister, Daniel Truhn and Jakob Nikolas Kather. Swarm learning for decentralized artificial intelligence in cancer histopathology. bioRxiv, 2021. Available at: https://doi.org/10.1101/2021.11.19.469139

More information about our research group is available at http://kather.ai

General outline

Here, we detail the workflow for three physically separated computer systems (in this repository they will be referred to as System A, System B, and System C). The SL process begins with the enrolment of nodes (processes) with the swarm network. System A is used to initialize the license server by starting the license container and installing the swarm license downloaded from the HPE login website. System A also starts the SPIFFE SPIRE container. The first SN process (node) to go online is referred to as the “sentinel” node and will be the first to register itself with the SPIFFE SPIRE network. When the SN node of the system A is ready, the SN node of System B and System C are run. During training each system trains its data batch in the local system till the merging criterion (sync interval) is reached. The node which finishes its training batch first will be the leader and will collect the learning from other peers (depending on the minimum number of peers, in our case two), average the learning weights and send it back.

Installation & Requirements

In general, the following is required in order to reproduce this experiment:

  • All three systems must be running Linux natively. We recommend to use newly created users with docker installed for all users. Running Linux in a virtual machine requires additional workarounds which are not described here. Here, we used Ubuntu 20.04.
  • At each system, the user requires administrator privileges for some installation steps. We recommend to switch on sudo privileges for the current user like this:
    1. sudo usermod -a -G sudo \<username> where ‘username’ is the name of the current user. Be aware that this should be disabled after running the experiments for security reasons.
  • The user at each system requires Docker which can be installed like this:
    1. sudo apt-get update
    2. sudo apt-get install docker-ce docker-ce-cli containerd.io The original Docker image was pulled from registry. Here, a separate docker image is built on top of the existing docker image for the SL node with all the Python modules necessary for our workflow.
  • Each user must be part of a docker group. This can be achieved by running the following command-line script from a user account with admin privileges in each system:
    1. sudo usermod -a -G docker \<username>
  • (optional) We recommend that each system has a CUDA-enabled GPU for faster training. Here, we propose a two-step approach with offline feature extraction and subsequently training the swarm network on these features, which speeds up training. This also allows training on computers without a GPU in reasonable time
  • For usage of the HPE swarm learning community edition, the initial step is to register an HPE license to run the SL platform. This process is managed by the Licence Server node. Docker containers are one of the important parts in the SL approach.

Example Data Set

A small example dataset has been provided along with this repository. We extracted four subsets from the TCGA colorectal (CRC) cohort. The subsets are taken from four contributing sites in TCGA: CM (Memorial Sloan Kettering Cancer Center), D5 (Greater Poland Cancer Center), G4 (Roswell Park) and A6 (Christiana Healthcare). Provided three systems are being used as suggested above, each system can be allocated a unique subcohort, with the remaining subcohort being used as an external test dataset. Because these datasets are much smaller than the ones in our study, the performance can vary markedly between multiple runs (unlike with large cohorts, where you usually get very similar results in multiple runs).

These datasets have been preprocessed according to “The Aachen Protocol for Deep Learning Histopathology: A hands-on guide for data preprocessing” The whole slide images (WSIs) were tessellated (without any annotations), normalized and feature vectors were extracted for all slides by using an off-the-shelf resnet model. The original WSIs are available at the GDC Data Portal (COAD and READ, together referred to as CRC). In this example, the features are already extracted and are saved in the folders "System A", "System B" and "System C" in this repository.

When using your own data, you can tessellate the WSIs using the above-mentioned references and extract features using the scripts in this repository.

System Preparation

Note: unless otherwise stated, the following must be done for each of the three systems.

  1. Clone this Github repository to each System

  2. Unzip the Dataset into the folder SWARM/System A/data for all systems

  3. Change Hyperparameters:

    1. On System A, get the IP address (open a terminal, run the command:hostname -I | cut -f1 -d ).
    2. On System B and System C open the sl-node and sn-node with an editor and insert the previously noted IP-address from System A in the predefined line (eg: system_A_ip=137.116.23.146).
    3. (Optional) the target label (prediction target) can be changed inside the experiment file. The user has to provide the target name that the model will train on on all 3 Systems. In our case, we train on microsatellite instability (MSI) status, the target is called "isMSIH" with two levels: "MSIH" and "nonMSIH".
  4. Setting up docker in all the systems:

    1. Login to the docker using the terminal type: docker login hub.myenterpriselicense.hpe.com -u <HPE-PASSPORT-EMAIL> -p hpe_eval
    2. Enable docker content trust export DOCKER_CONTENT_TRUST=1
    3. Create a docker image with the name pyt-cv2 using the Dockerfile on all systems:
      • open terminal in docker folder
      • docker build -t pyt-cv2 .
  5. Connect systems via a passwordless ssh and create a docker image:

    1. (Optional) passwordless SSH (note: if not performed, disruptive password inputs will otherwise be required at multiple stages of the experiment) Has to be done on Systems B and C
      • open a terminal and run ssh-keygen
      • run cat ~/.ssh/id_rsa.pub
      • run ssh <linux username System A>@<IP of System A>
      • run mkdir ~/.ssh
      • run cat > ~/.
      • run cat >> ~/.ssh/authorized_keys

Run Experiment

  1. (Only on System A) Run the swarm learning setup
    1. Open a terminal in "SWARM\System A\swarm-learning\bin"
    2. bash run-apls
      An example of the expected output:: alt text
    3. Upload the HPE license key:
      1. open the following website in your browser: https://<ip>:5814/autopass/login_input however substitute the ip with System As ip-adress(eg. https://137.226.23.146:5814/autopass/login_input)
      2. Use the default settings user_name: admin, and password: password and change the password as prompted
      3. Perform the steps in the following image: alt text
      4. A message should appear in the browser that the license key has been uploaded successfully.
      5. Do not close the terminal and the browser window.
  2. (Only on System A) In a new terminal, Start the SPIRE server .sh file in “SWARM\System A\”spire-server.sh
    1. Go to System A/
    2. sh spire-server.sh
    3. Wait until the last lines of the output appears as follows: alt text
  3. (Only on System A ) In a new terminal Run the SN Node:
    1. Go to System A/
    2. sh sn-node-sentinal.sh
    3. Wait until the port appear similar to the following: alt text
  4. Run the sn-node.sh file in the other two systems:
    1. Go to System #/ #do so for B on System B and C on System C
    2. sh sn-node.sh
    3. Wait until the output looks similar to the screenshot above.
  5. Run sl-node in all three systems
    1. Go to System #/ #do so for all three systems
    2. sh sl-node.sh
    3. This will initialize the training of the model. The expect output is as follows: alt text
  6. As soon as all systems are done, the training is will finish. The final, trained model will be saved in SWARM/System A/MODEL/saved_model/ as a .pkl file.

Troubleshooting and mishaps

  • When starting a node, network communication issues might cause errors. Often, starting the node again or restarting all nodes will resolve the issue.
  • In our example, System A runs the SPIRE server. If this system shuts down or loses internet connection during training, the training process is stopped at this point.
  • If Systems B or C lose internet or drop out from training for other reasons, The training process will continue and the dropped node or system can rejoin at a later time before the training is completed.
  • If several peers drop out during training and fewer peers than the "minumum number of peers" are left, then the training stops and has to be restarted from the beginning. Therefore, for large networks, it is advisable to set a liberal (low) number in the minimum peers setting.
  • Further information regarding the use of HPE Swarm Learning can be found in the documentation section of the following repository: HPE Swarm Learning. Issues regarding the HPE package should be posted there and are usually responded to by the HPE team.

HIPAA compliance

The source codes in this repository rely on Docker containers. These containers are not per se compliant with HIPAA and some healthcare institutions in the United States have reservations about using Docker containers. Some of these issues can be remedied by using Singularity containers or other similar platforms. Future releases of our software will incorporate these options.

License

All data and source codes in this repository are released under the MIT license:

Copyright 2021-2022. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.