Swarm Learning is a decentralized, privacy-preserving Machine Learning framework. This framework utilizes the computing power at, or near, the distributed data sources to run the Machine Learning algorithms that train the models. It uses the security of a blockchain platform to share learnings with peers in a safe and secure manner. In Swarm Learning, training of the model occurs at the edge, where data is most recent, and where prompt, data-driven decisions are mostly necessary. In this completely decentralized architecture, only the insights learned are shared with the collaborating ML peers, not the raw data. This tremendously enhances data security and privacy.
Swarm Learning framework is made up of various components known as nodes, such as Swarm Learning (SL) nodes, Swarm Network (SN) nodes, Swarm Learning Command Interface (SWCI) nodes, and Swarm Operator (SWOP) nodes. Each node of Swarm Learning is modularized and runs in a separate container. The nodes represent different Swarm Learning functionality and not physical server nodes.
-
SL nodes run the core of Swarm Learning. An SL node works in collaboration with all the other SL nodes in the network. It regularly shares its learnings with the other nodes and incorporates their insights. SL nodes act as an interface between the user model application and other Swarm Learning components. SL nodes take care of distributing and merging model weights in a secured way.
-
SN nodes form the blockchain network. The current version of Swarm Learning uses an open-source version of Ethereum as the underlying blockchain platform. The SN nodes interact with each other using this blockchain platform to maintain and track progress. The SN nodes use this state and progress information to co-ordinate the working of the other swarm learning components.
Sentinel Node is a special SN node. The Sentinel node is responsible for initializing the blockchain network. This is the first node to start.
NOTE: Only metadata is written to the blockchain. The model itself is not stored in the blockchain.
-
SWCI node is the command interface tool to the Swarm Learning framework. It is used to monitor the Swarm Learning framework. SWCI nodes can connect to any of the SN nodes in a given Swarm Learning framework to manage the framework. For more information on SWCI, see Swarm Learning Command Interface.
-
SWOP is an agent that can manage Swarm Learning operations. SWOP is responsible to execute tasks that are assigned to it. A SWOP node can execute only one task at a time. SWOP helps in executing tasks such as starting and stopping Swarm runs, building and upgrading ML containers, and sharing models for training. For more information about SWOP, see Swarm Operator node (SWOP).
-
Swarm Learning security and digital identity aspects are handled by X.509 certificates. Communication among Swarm Learning components are secured using X.509 certificates. User can either generate their own certificates or directly use certificates generated by any Standard Security software such as SPIRE. For more information on SPIRE, see https://thebottomturtle.io/Solving-the-bottom-turtle-SPIFFE-SPIRE-Book.pdf and https://spiffe.io/.
NOTE: Swarm Learning framework does not initialize if certificates are not provided.
- Swarm Learning components communicate with each other using a set of TCP/IP ports.
NOTE: The participating nodes must be able to access each other's ports.
For more information on port details that must be opened, see Exposed Ports.
- License Server installs and manages the license that is required to run the Swarm Learning framework. The licenses are managed by the AutoPass License Server (APLS) that runs on a separate node. For more information, see APLS User Guide.
Swarm Learning nodes works in collaboration with other Swarm Learning nodes in the network. It regularly shares its learnings with the other nodes and incorporates their insights. This process continues until the Swarm Learning nodes train the model to desired state.
User can transform any Keras or PyTorch based ML program that is written using Python3 into a Swarm Learning ML program by making a few simple changes to the model training code by including the SwarmCallback
API. For more information, see any of the examples included with the Swarm Learning package for a sample code.
The transformed user Machine Learning (user ML node) program can be run on the host or user can build it as a Docker container.
NOTE: HPE recommends users to build an ML Docker container.
The ML node is responsible to train and iteratively update the model. For each ML node, there is a corresponding SL node in the Swarm Learning framework, which performs the Swarm training. Each pair of ML and SL nodes must run on the same host. This process continues until the SL nodes train the model to the desired state.
NOTE: All the ML nodes must use the same ML platform either Keras (based on TensorFlow 2 backend) or PyTorch. Using Keras for some of the nodes and PyTorch for the other nodes is not supported.
- Prerequisites for Swarm Learning
- Clone this repository on all machines where you want to run Swarm Learning.
NOTE: The suggested default location is to clone it under/opt/hpe
. It will create aswarm-learning
folder and copy the files under it. If you clone it in a different location, please make sure to give the same location when running the installer UI, for theSwarm Installation location
text box as "<clone-location>/swarm-learning
". For the default case, the UI screen would have/opt/hpe/swarm-learning
pre-populated.
CAUTION: Users are recommended not to save their model related artifacts under this folder, as future version upgrade of Swarm Learning would delete these folders. - Upgrading from earlier evaluation versions
- Download and setup Swarm Learning using the Web UI installer
- Execute MNIST example
- Frequently Asked Questions
- Troubleshooting
- How Swarm Learning Components interact
- Component interactions when using Reverse Proxy
- Swarm Learning Concepts
- Working of a Swarm Learning node
- Adapting ML programs for Swarm Learning
- Swarm wheels package
- Configuring Swarm Learning components
- Running Swarm Learning Components
- Using SWCI
- Using SWOP
- Examples
- Swarm Learning Log Collection
Refer to Acronyms and Abbreviations for more information.
Feedback and questions are appreciated. You can use the issue tracker to report bugs on GitHub.
or
Join the HPE Developer Slack Workspace and start a discussion in our #hpe-swarm-learning channel.
Refer to Contributing for more information.
The distribution of Swarm Learning in this repository is for non-commercial and experimental use under this license.
See ATTRIBUTIONS and DATA LICENSE for terms and conditions for using the datasets included in this repository.