/jeddak

Primary LanguageC++

Competition-Oriented Jeddak Platform

Overview

Jeddak provides a both academia- and industry-oriented platform for privacy computing and federated learning.

This is a competition-oriented lite version of Jeddak. Three guides for deploy, develop and use, respectively are provided below.

Deploy Guide

Jeddak provides two deployment modes: standalone and cluster, where standalone mode is for fast experimental verifications of new algorithms over a single host, and cluster mode supports production in real multi-host applications. Note that the competition is conducted over the cluster mode.

Refer to doc/guide/quickstart.md for deployment guide.

Develop Guide

Jeddak provides standardized interfaces for developing your own federated learning and privacy-preserving algorithms.

Refer to doc/guide/develop_guide.md for more details.

Use Guide

Algorithm List

Jeddak provides a series of developed privacy-preserving algorithms as described in the following table. For this lite version, a limited number of such algorithms are mainly for the purpose of demonstration. Their configurations can be found at example/conf/.

Algorithm Name Classification Description
data_loader Preprocessing Read data from various data sources
data_saver Postprocessing Save data to disk in various data structures
aligner Preprocessing Seek the intersection of the private sets held by multiple parties in a privacy-preserving fashion
glm Federated Learning A set of generalized linear models, including linear regression, logistic regression and poisson regression
dpgbdt Federated Learning Differentially Private Gradient Boosting Decision Tree
neural_network Federated Learning Deep Neural Network
evaluate Postprocessing Evaluate a federated learning model
model_loader Postprocessing Load model from local file / unload model from memory
predict_offline Postprocessing Offline prediction through specified model

Parameter List

data_loader parameters

Parameter Type Range Default Description
task_type str "data_loader" "data_loader" task type
task_role str {"guest", "host", "sole", "slack"} "guest" task role. "guest/host" means party's role in a task. "sole" means only this party carries out the task. "slack" means the party does nothing in the task
input_data_source str {"csv", "hdfs"} "csv" type of input data source. "csv" means local files and "hdfs" means a file path of Hadoop HDFS.
input_data_path str any strings N/A file path of input data which is valid and readable
train_data_path str any strings N/A file path of train data which is valid and readable, if not, will get from input_data_path.
validate_data_path str any strings N/A file path of validate data which is valid and readable
convert_sparse_to_index bool {true, false} true convert sparse features to natural numbers if true

data_saver parameters

Parameter Type Range Default Description
task_type str "data_saver" "data_saver" task type
task_role str {"guest", "host", "sole", "slack"} "guest" task role
output_data_source str {"csv"} "csv" type of output data source

aligner parameters

Parameter Type Range Default Description
task_type str "aligner" "aligner" task type
task_role str {"guest", "host"} "guest" task role
align_mode str {"diffie_hellman", "cm20", "dh_PSI", "tee"} "cm20" psi type
output_id_only bool {true, false} true output only id of each element in the intersection set
sync_intersection bool {true, false} true synchronizing the intersection set among all parties
key_size int {1024, 2048, 3072, 4096} 1024 cryptographic key length (in bits)
batch_num int {"auto"}, [1, inf) "auto" batch number for PSI in "cm20" mode, integer will be rounded up to power of 2

glm parameters

Parameter Type Range Default Description
task_type str {"linear_regression", "logistic_regression", "poisson_regression"} N/A task type
task_role str {"guest", "host", "server", "client"} "guest" task role
penalty str {"l1", "l2", null} "l2" penalty term
tol float [0, inf) 1e-4 tolerance for stopping criteria
C float (0, inf) 1.0 inverse of regularization strength
fit_intercept bool {true, false} true bias
intercept_scaling float [0, inf) 1.0 x becomes [x, self.intercept_scaling] if fit_intercept is true
solver str {"gradient_descent", "AdaGram", "AdaDelta", "RMSprop"} "gradient_descent" optimization method
max_iter int [1, inf) 100 maximum iteration rounds
learning_rate float (0, inf) 0.15 learning step size
homomorphism str {"cpaillier"} "cpaillier" homomorphic encryption method
key_size int [1, inf) 1024 homomorphic encryption key size
gamma float (0, inf) 0.9 adjust the sum of past squared gradients
epsilon float (0, inf) 1e-8 smooth gradient and avoid division by zero
batch_fraction float (0, 1] 0.1 the subset fraction of mini-batch training
batch_type str {"batch", "mini-batch"} "batch" batch method
balanced_class_weight bool {true, false} true automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)), auto-disabled for continuous labels
train_validate_freq int [1, inf) None validation using validate data each train_validate_freq epoch if train_validate_freq is not None

dpgbdt parameters

Parameter Type Range Default Description
task_type str "dpgbdt" "dpgbdt" task type
task_role str {"guest", "host"} "guest" task role
objective str {"reg_squarederror", "binary_logistic", "count_poisson"} "binary_logistic" learing objective
num_round int [1, inf) 20 the number of boosting rounds
eta float (0, inf) 0.3 learning rate
gamma float [0, inf) 0.0 minimum loss reduction required to make a further partition on a leaf node of the tree
max_depth int [1, inf) 3 maximum depth of a tree
min_child_weight float [0, inf) 1.0 minimum sum of instance weight (hessian) needed in a child
max_delta_step float [0, inf) 0.0 maximum delta step we allow each leaf output to be
sub_sample float (0, 1] 1.0 subsample ratio of the training instances at each boosting iteration
lam float [0, inf) 1.0 L2 regularization strength
sketch_eps float (0, 1) 0.03 convert every column into 1 / sketch_eps number of bins at most
homomorphism str {"cpaillier"} "cpaillier" homomorphic encryption method
key_size int [1, inf) 1024 homomorphic encryption key size
importance_type str {"weight", "gain", "cover", "total_gain", "total_cover", "all"} "weight" feature importance type
train_validate_freq int [1, inf) None validation using validate data each train_validate_freq tree if train_validate_freq is not None

neural_network parameters

Parameter Type Range Default Description
task_type str "neural_network" "neural_network" task type
task_role str {"guest", "host"} "guest" task role
backend str {"keras", "pytorch"} None backend framework of deep learning
format str {"file", "conf"} None input format of top/mid/bottom model to be loaded
btm str Any None keras model config json string or model file path
mid str Any None keras model config json string or model file path
top str Any None keras model config json string or model file path
epochs int [1, inf) 1 epochs of training
batch_size int [1, inf) 1 batch size of training
loss_fn str {"CrossEntropyLoss", "MSELoss", ...} None loss function of top model
learning_rate float [0, inf) 0.001 learning rate of training
optimizer str {"SGD", "Adam", ...} None optimizer of top/bottom model
use_mid bool {true, false} true use mid model for vertical-nn or not (only top/bottom models)
mid_shape_in int [1, inf) 1 input shape of mid model, the same as output shape of host bottom model
mid_shape_out int [1, inf) 1 output shape of mid model, equals to input shape of guest top model minus output shape of guest btm model
mid_activation str {"linear", "Relu", ...} "linear" activation function of mid model
privacy_mode str {"plain"} "plain" encryption mechanism of interaction between multiple parties
metrics str {"accuracy", "", ...} None output metrics for model evaluation
predict_model str {"categorical", "", ...} None set to "categorical" only if top model is a classification model and the prediction value should be transformed to categorical vector
num_classes int [1, inf) None number of categories, only needed in the case of top model is a classification model and the option "predict_model" is "categorical"
client_frac (0.0, 1.0] float 1.0 (cluster-server mode) the fraction of clients selected to update the global model
model_conf str Any None keras model config json string or model file path
train_validate_freq int [1, inf) None validation using validate data each train_validate_freq epoch if train_validate_freq is not None

evaluate parameters

Parameter Type Range Default Description
task_type str "evaluate" "evaluate" task type
task_role str {"guest", "host"} "guest" task role

model_loader parameters

Parameter Type Range Default Description
task_type str "model_loader" "model_loader" task type
task_role str {"guest", "host"} "guest" task role
model_id str {model_id} None id of model to be loaded/unloaded
action str {"load", "unload"} None load/unload model

predict_offline parameters

Parameter Type Range Default Description
task_type str "predict_offline" "predict_offline" task type
task_role str {"guest", "host"} "guest" task role
model_id str {model_id} None id of model used for prediction
input_data_path str {file_path} None input file's path and filename