Cluster implementation of "Learning to See in the Dark"

source code: https://github.com/cchen156/Learning-to-See-in-the-Dark

Install TensorflowOnSpark

Run pip install tensorflow tensorflowonspark on all the machines (Dom0, VM1 - VM8)
Add the following lines to /etc/profile file:
export QUEUE=default
export LIB_HDFS=$HADOOP_HOME/lib/native
export LIB_JVM=$JAVA_HOME/jre/lib/amd64/server
export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7
export LD_LIBRARY_PATH=${PATH}

Run training

Test run (6 images, 10 epochs, batch size 2). Input directory with the test dataset is hdfs://gpu10:9000/Sony_pickle_test/, model output is hdfs://gpu10:9000/Sony_model_test.
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 15 \
--driver-memory 3G \
--executor-memory 3G \
--py-files /home/hduser/see-in-the-dark/train_Sony.py,/home/hduser/see-in-the-dark/inference_Sony.py,/home/hduser/see-in-the-dark/inference_Sony_our.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \
--conf spark.driver.memory=3G \
--conf spark.executor.memory=3G \
--conf spark.driver.maxResultSize=2G \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
/home/hduser/see-in-the-dark/script.py \
--batch_size 2 \
--steps 30 \
--model hdfs://gpu10:9000/Sony_model_test \
--input-dir hdfs://gpu10:9000/Sony_pickle_test/image_data \
--gt-dir hdfs://gpu10:9000/Sony_pickle_test/gt_data
To run in a client mode replace the following lines:
--deploy-mode client \
--driver-memory 1G \
--conf spark.yarn.am.memory=1G \
Full dataset. Input directory with the full dataset is hdfs://gpu10:9000/Sony_pickle/, model output is hdfs://gpu10:9000/Sony_model.
${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 15 \
--driver-memory 3G \
--executor-memory 3G \
--py-files /home/hduser/see-in-the-dark/train_Sony.py,/home/hduser/see-in-the-dark/inference_Sony.py,/home/hduser/see-in-the-dark/inference_Sony_our.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \
--conf spark.driver.memory=3G \
--conf spark.executor.memory=3G \
--conf spark.driver.maxResultSize=2G \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
/home/hduser/see-in-the-dark/script.py

Run inference

${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--queue ${QUEUE}
--num-executors 15
--driver-memory 3G
--executor-memory 3G
--py-files /tmp/pycharm_rustam/train_Sony.py,/tmp/pycharm_rustam/inference_Sony.py,/tmp/pycharm_rustam/inference_Sony_our.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS
--conf spark.driver.memory=3G
--conf spark.executor.memory=3G
--conf spark.driver.maxResultSize=2G
--conf spark.executor.cores=1
--conf spark.task.cpus=1
/tmp/pycharm_rustam/script.py
--mode inference
--steps 1
--model hdfs://gpu10:9000/Sony_model
--inference our --inputfile hdfs://gpu10:9000/predict_images/20005_01_0.1s.ARW20190418-150337.pkl --outputfile testResult.pkl

Run server

To start flask application, do the following commands: cd flask_app \
source flaskapp/bin/activate \
export FLASK_APP=flask_app.py \
flask run --host=0.0.0.0 --port=6000
Connect to vpn.cs.hku.hk, use browser to connect http://202.45.128.135:22610/
Upload ARW image to the cluster via the web applciation. Image uploading and processing might take 2-4 minutes depending on file size and network speed.

hadipash/Learning-to-See-in-the-Dark-on-Cluster

Cluster implementation of "Learning to See in the Dark"

Install TensorflowOnSpark

Run training

Run inference

Run server