/PyTorch-On-Angel

PyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.

Primary LanguageScala

Pytorch on Angel

A light-weight project which runs pytorch on angel, providing pytorch the ability to run with high-dimensional models.

Architecture


Pytorch on Angel's architecture design consists of three modules:

  • python client: python client is used to generate the pytorch script module.
  • angel ps: provides a common Parameter Server (PS) service, responsible for distributed model storage, communication synchronization and coordination of computing.
  • spark executor: the worker process is responsible for data processing、load pytorch script module and communicate with the Angel PS Serverto complete model training and prediction, especially pytorch c++ backend runs in native mode for actual computing backend.

To use Pytorch on Angel, we need three components:

  • a jar file generated by the java subproject;
  • a .so file compile by the cpp subproject with set of shared libraries for pytorch c++ backend;
  • the pytorch algorithm script module generated by the python subproject.

Compilation & Deployment Instructions by Docker

Compile jar file and the shared c++ libraries package

# Below script will build the jar files and bunlde the shared c++ libraries in containers
# The generated files 'pytorch-on-angel-<version>.jar' and 'torch.zip' are in ./dist
./build.sh

Generate a pytorch script model

# We have implemented some algorithms in the python under the root directory
# Below script will generate a deepfm model deepfm.pt in ./dist
./gen_pt_model.sh python/recommendation/deepfm.py --input_dim 148 --n_fields 13 --embedding_dim 10 --fc_dims 10 5 1

Compilation & Deployment Instructions Manually

If you don't have a docker environment, you can compile it manually, but you need to install all the dependencies on the machine. We strongly recommend using docker to compile.

Install Pytorch

we support pytorch version from 1.2.0 to 1.5.0, it is recommended to use version 1.5.0

  • pytorch =v1.5.0
  • python =3.7

we recommend using anaconda to install pytorch, run command:

conda install -c pytorch pytorch==1.5.0 torchvision==0.6.0 cpuonly

pytorch detailed installation documentation can refer to pytorch installation

Compiling java submodule

  1. Compiling Environment Dependencies

    • Jdk >= 1.8
    • Maven >= 3.0.5
  2. Source Code Download

    git clone https://github.com/Angel-ML/PyTorch-On-Angel.git
    
  3. Compile
    Run the following command in the java root directory of the source code:

    mvn clean package -Dmaven.test.skip=true
    

    After compiling, a jar package named 'pytorch-on-angel-<version>.jar' will be generated in target under the java root directory.

Compiling cpp submodule

  1. Compiling Environment Dependencies

    • gcc >= 5
    • cmake >= 3.12
  2. LibTorch Download

    • Download the libtorch package from here and extract it to the user-specified directory
    • set TORCH_HOME(path to libtorch) in CMakeLists.txt under the cpp root directory
  3. Compile Run the following command in the cmake-build-debug directory under the cpp root directory:

    cmake ..
    make
    

    After compiling, a shared library named 'libtorch_angel.so' will be generated in cmake-build-debug under the cpp root directory.

Quick Start

Spark on Angel deployment

PyTorch on angel runs on Angel, so you must deploy the Angel client first. The specific deployment process can refer to documentation.
note: It is recommended to run PyTorch on Angel on Angel 3.2.0

Submit to Cluster

Use $SPARK_HOME/bin/spark-submit to submit the application to cluster in the pytorch on angel client.
Here are the submit example for deepfm.

  1. Generate pytorch script model
    follow Compilation & Deployment Instructions by Docker to generate pytorch model file or you can generate by manually, for example:

    python deepfm.py --input_dim 148 --n_fields 13 --embedding_dim 10 --fc_dims 10 5 1
    
  2. Package c++ library files

    You should put the compiled libtoch_angel.so into the lib package of libtroch, and then package it, follow Compilation & Deployment Instructions by Docker or Manually to get c++ library package, for example named torch.zip

  3. Upload training data to hdfs upload training data python/recommendation/census_148d_train.libsvm.tmp to hdfs directory

  4. Submit to Cluster ,

    source ./spark-on-angel-env.sh  
    $SPARK_HOME/bin/spark-submit \
           --master yarn-cluster\
           --conf spark.ps.instances=2 \
           --conf spark.ps.cores=1 \
           --conf spark.ps.jars=$SONA_ANGEL_JARS \
           --conf spark.ps.memory=3g \
           --conf spark.ps.log.level=INFO \
           --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/torch-lib \
           --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/torch-lib \
           --conf spark.executor.extraLibraryPath=./torch/torch-lib \
           --conf spark.driver.extraLibraryPath=./torch/torch-lib \
           --conf spark.executorEnv.OMP_NUM_THREADS=2 \
           --conf spark.executorEnv.MKL_NUM_THREADS=2 \
           --queue $queue \
           --name "deepfm for torch on angel" \
           --jars $SONA_SPARK_JARS  \
           --archives torch.zip#torch\
           --files deepfm.pt \  
           --driver-memory 1g \
           --num-executors 2 \
           --executor-cores 1 \
           --executor-memory 3g \
           --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \
           ./pytorch-on-angel-0.3.0.jar \  
           trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
           stepSize:0.001 numEpoch:10 testRatio:0.1 \
           angelModelOutputPath:$output \
    

Algorithms

Currently, PyTorch on Angel supports a series of recommendation and deep graph convolution network algorithms.

  1. Recommendation Algorithms
  2. Graph Algorithms