Lightweight framework for distributing machine learning training based on Rabit for the communication layer. We borrowed Horovod's concepts for the TensorFlow optimizer wrapper.
git clone https://github.com/criteo/tf-collective-all-reduce
python3.6 -m venv tf_env
. tf_env/bin/activate
pip install tensorflow==1.12.2
pushd tf-collective-all-reduce
./install.sh
pip install -e .
popd
tf-collective-all-reduce only supports Python ≥3.6
pip install -r tests-requirements.txt
pytest -s
Local run with dmlc-submit
../dmlc-core/tracker/dmlc-submit --cluster local --num-workers 2 python examples/simple/simple_allreduce.py
Run on a Hadoop cluster with tf-yarn
Run collective_all_reduce_example
cd examples/tf-yarn
python collective_all_reduce_example.py