Join slack and enter CRF-as-RNN channel to discuss
https://deep-learning-geeks-slack.herokuapp.com/
This repo is just to
-
show you how to train CRF as RNN with PASCAL VOC datasets (20 classes + background)
-
be a well maintained place to communicate with this methods
-
try to rewrite CRF as RNN with Caffe2 (join the slack team and let's discuss together)
-
Single GPU
AWS
p2.xlarge
instance (spot instance ~0.2$/hour, with Tesla K80 GPU, 12G Memory) will be enough for the training purpose.Equivalent setup may work.
-
Multipile GPUs
You need to make some changes to achieve that and I haven't succeeded doing this for once (tried with 3GPUs with 18G memory intotal but failed) but I will update if I see any.
Please refer to this repo for all the details and commands. And it may take you around half -> one hour.
-
OpenCV (You may do not need OpenCV)
-
NVIDIA Driver
-
CUDA
-
CUDNN
-
Get CRF as RNN and navigate to the correct branch
>$ git clone https://github.com/torrvision/caffe.git >$ git checkout crfrnn
You will have a repo called
caffe
lying on whatever you run the command above. -
Change some source code to optimize memory consumption (!IMPORTANT)
If you just begin to build, you probably will meet this issue, which sucks.
Therefore, you need to change some Caffe source code with this PR.
Details:
-
Change
caffe/src/caffe/layers/base_conv_layer.cpp
:line 12: Blob<Dtype> BaseConvolutionLayer<Dtype>::col_buffer_; line 13: template <typename Dtype>
-
Change
caffe/include/caffe/layers/base_conv_layer.hpp
:line168: static Blob<Dtype> col_buffer_;
This can reduce the GPU memory consumption via sharing memory but with a known but ignorable bug.
-
-
Configure Make
In the root folder of
caffe
, there's a file calledMakefile.config.example
.Copy and Paste and then rename it to
Makefile.config
or just runcp Makefile.config.example Makefile.config
-
If you install OpenCV separately, uncomment
USE_PKG_CONFIG := 1
-
If you want to train with Multiple GPU, you need to uncomment
USE_NCCL := 1
and install NCCL -
If you want to use
OpenBlas
which works more efficiently with multiple CPUs instead ofATLAS
, you probably wanna changeBLAS := atlas
toBLAS := open
, which I don't think is necessary -
Comment out all the
60, 61
arch options since the machine you are using is probably not going to support them unless you have a machine can
Then in root folder of
caffe
just run :make all
and in case you fucked up, you can run
make clean
to clean everything and re-make
The process of make may take a while, like 10~min.
-
The whole idea is that, we need to
-
Download PASCAL VOC dataset (which is very large)
-
Label them
-
Create LMDB for Caffe to easy access
So, I found this repo is doing good job on this step.
Therefore, you need to:
-
Clone this repo
-
Prepare
Follow the step from Prepare dataset for training to Create LMDB database and stop here.
You will have trouble on executing last step because this script needs a very small functionality to dump img into datum.
Therefore, you have two options:
a. In the root folder of this repo, also clone a Caffe (can be any version) and just build it like above except that you don't do any changes, basically:
- Clone Caffe
- run:
cp Makefile.config.example Makefile.config make all make pycaffe #!IMPORTANT and we didn't do that above because we didn't need this while you need this here
b. Go to the caffe we built above and continue build pycaffe and change the file path:
- run command: make pycaffe
- add caffe root like this example after thisline with the actual caffe relative/absolute(recommended) path
After you've done those two above, you should be able to finish all the labeling step and thereby, in the folder of
train-CRF-RNN
, you will be able to see those folders:- train_images_20_lmdb
- train_labels_20_lmdb
- test_images_20_lmdb
- test_labels_20_lmdb
-
Clone this repo in a different place
git clone https://github.com/KleinYuan/train-crfasrnn.git
-
Edit
trainKit/CRFRNN_train.prototxt
Replace
${PATH}
in line7/19/31/41 with actual absolute path of those folders I listed above. -
Edit Make file of this repo
Replace
${CAFFE_PATH}
with the root path of the caffe we built above and replace${TRAIN_CRF_RNN_PATH}
with root path of this repo.
If you check the Makefile, you will see I offer you four choices:
-
Train-single-gpu-from-0
-
Train-multiple-gpus-from-0
-
Train-single-gpu-fine-tuning
-
Train-multiple-gpus-fine-tuning
If you wanna train a model from sratch, you need to download the FCN-8s model and put it in the folder of Makefile by just run:
wget http://dl.caffe.berkeleyvision.org/fcn-8s-pascal.caffemodel
If you wanna train a model based on the pre-trained model, you need to download the TVG_CRFRNN_COCO_VOC.caffemodel
(be aware of the LICENCE of this model, it's not free for commercial usage):
wget http://goo.gl/j7PrPZ -O TVG_CRFRNN_COCO_VOC.caffemodel
Finally, you can train the model based on your purpose with the Makefile by running one of following command:
make Train-single-gpu-from-0
or
make Train-multiple-gpus-from-0
or
Train-single-gpu-fine-tuning
or
Train-multiple-gpus-fine-tuning
Also, if you wanan do multiple GPUs and have many GPUs, just keep adding 2, 3, 4...
on 0, 1
after the -gpu
flag.
So, for Multiple GPUs training with Caffe, it's very picky for environment.
Those dependenciesļ¼changes are necessary to achieve this:
-
NCCL, which is subjected to CUDA version and be aware of the branch
-
Caffe Makefile uncomment
USE_NCCL := 1
Potential problems you will meet are:
-
Check failed: error == cudaSuccess (2 vs. 0) out of memory
----> it means what it says, your memory is not enough -
error==cudaSuccess (77 vs. 0) an illegal memory access was encountered
-----> it means that either the shape is not correct or your cuda version is not correct, check here -
iteration 0
stuck for a long time ----> it's normal, just chill and drink Coffee