use nmtwizard/opennmt-tf image to train a Transformer with GPU
Mercy811 opened this issue · 6 comments
overview
AIM: use nmtwizard/opennmt-tf image to train a Transformer with GPU
the container only runs for a few seconds and stop automatically
- is there any way to observe running processess that is using gpu?
the result ofnvidia-smi
is always like this:
- according to the latest Nvidia CUDA toolkit, it is same to use
docker run --gpus all
asnvidia-docker run
. so it seems not the problem
details
local machine directory strucure:
/data-1/xyye
|
|—corpus
| |— test-infoq
| | |—train
| | | |—src-train-infoq.en
| | | |—tgt-train-infoq.zh
| | |
| | |—vocab
| | |--src-vocab-infoq.txt
| | |—tgt-vocab-infoq.txt
| |
| |— toy-ende
|
|—config
| |—run-config.json
|
|—models
|
|—workspace
run-config.json
content in option
is written according to OpenNMT-tf Document
"source": "en",
"target": “zh",
"data": {
},
"tokenization": {
},
"options": {
"mode_type": “Transformer”
"data": {
"train_features_file": "${CORPUS_DIR}/train/src-train-infoq.en",
"train_labels_file": "${CORPUS_DIR}/train/tgt-train-infoq.zh",
"source_vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt",
"target_vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"
}
"config": {
"params": {
"optimizer": "GradientDescentOptimizer",
"learning_rate": 1,
"param_init": 0.1,
"clip_gradients": 5.0,
"beam_width": 5
},
"train": {
"batch_size": 64,
"bucket_width": 2,
"maximum_features_length": 50,
"maximum_labels_length": 50,
"save_checkpoints_steps": 5000,
"keep_checkpoint_max": 8
}
}
}
}
command line
cat run-config.json | docker run --gpus all -a STDIN -i -—rm \
-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \
-v $CORPUS_DIR/test-infoq:/root/corpus \
-v $MODELS_DIR:/root/models \
nmtwizard/opennmt-tf \
-c - -ms /root/models -g 1 train
I've already set persistent environment variables in ~/.bashrc
export CORPUS_DIR='/data-1/xyye/corpus'
export MODELS_DIR='/data-1/xyye/models'
export WORKSPACE_DIR='/data-1/xyye/workspace'
Most of you setup looks correct, except the data configuration which should not happen in the "options" section, as it is prepared outside of OpenNMT-tf.
Try something like:
{
"source": "en",
"target": "zh",
"tokenization": {
"source": {"vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt"},
"target": {"vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"}
},
"options": {
"mode_type": "Transformer",
"config": {
"params": {
"optimizer": "GradientDescentOptimizer",
"learning_rate": 1,
"param_init": 0.1,
"clip_gradients": 5.0,
"beam_width": 5
},
"train": {
"batch_size": 64,
"bucket_width": 2,
"maximum_features_length": 50,
"maximum_labels_length": 50,
"save_checkpoints_steps": 5000,
"keep_checkpoint_max": 8
}
}
}
}
Thx for your help! But here are still some problems
what i changeed
to make it simple, i did some little changes to local directory strucure
/data-1/xyye
|
|—corpus
| |—train
| | |—src-train-infoq.en
| | |—tgt-train-infoq.zh
| |
| |—vocab
| |—src-vocab-infoq.txt
| |—tgt-vocab-infoq.txt
|
|—config
| |—run-config.json
|
|—models
|
|—workspace
command line:
- mount workspace here
- use
--storage_config
cat run-config.json | docker run --name=logtest --gpus all -a STDIN -i \
-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \
-v $CORPUS_DIR:/root/corpus \
-v $MODELS_DIR:/root/models \
-v $WORKSPACE_DIR:/root/workspace \
-v /data-1/xyye/config:/root/config \
nmtwizard/opennmt-tf \
-c - -ms /root/models -g 1 --storage_config /root/config/storage-config.json train
run-config.json
as you commented above
{
"source": "en",
"target": "zh",
"tokenization": {
"source": {"vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt"},
"target": {"vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"}
},
"options": {
"mode_type": "Transformer",
"config": {
"params": {
"optimizer": "GradientDescentOptimizer",
"learning_rate": 1,
"param_init": 0.1,
"clip_gradients": 5.0,
"beam_width": 5
},
"train": {
"batch_size": 64,
"bucket_width": 2,
"maximum_features_length": 50,
"maximum_labels_length": 50,
"save_checkpoints_steps": 5000,
"keep_checkpoint_max": 8
}
}
}
}
got the following error message
2019-10-11 07:11:01Z.000607 [beat_service] WARNING start_beat_service: CALLBACK_URL or task_id is unset; beat service will be disabled
2019-10-11 07:11:01Z.000607 [utility] INFO run: Starting executing utility NMT framework=?
2019-10-11 07:11:01Z.000607 [framework] INFO train_wrapper: Starting training model 884c6a73-d7af-4587-96a8-4d4ccc78bd92
2019-10-11 07:11:01Z.000607 [preprocess] WARNING generate_preprocessed_data: No 'data' field in configuration, default corpus directory and all corpora are used.)
2019-10-11 07:11:01Z.000608 [framework] INFO _merge_multi_training_files: Merging training data to /data-1/xyye/workspace/data/merged/train/train.{en,zh}
Traceback (most recent call last):
File "entrypoint.py", line 241, in <module>
OpenNMTTFFramework().run()
File "/root/nmtwizard/utility.py", line 203, in run
stats = self.exec_function(args)
File "/root/nmtwizard/framework.py", line 296, in exec_function
push_model=not self._no_push)
File "/root/nmtwizard/framework.py", line 382, in train_wrapper
self._build_data(local_config))
File "/root/nmtwizard/framework.py", line 874, in _build_data
data_dir, train_dir, config['source'], config['target'])
File "/root/nmtwizard/framework.py", line 884, in _merge_multi_training_files
data_util.merge_files_in_directory(data_path, merged_path, source, target)
File "/root/nmtwizard/data.py", line 20, in merge_files_in_directory
files = [f for f in os.listdir(input_dir) if os.path.isfile(os.path.join(input_dir, f))]
OSError: [Errno 2] No such file or directory: '/data-1/xyye/corpus/train'
questions
i traced this error in source code, but didn't figure it out. Even came up with a new question
- the value of
input_dir
in
File "/root/nmtwizard/data.py", line 20, in merge_files_in_directory
files = [f for f in os.listdir(input_dir) if os.path.isfile(os.path.join(input_dir, f))]
was set in this function - preprocess.generate_preprocessed_data
. I am wondering why train_dir = config['data']['train_dir']
, if there is a data
seciton of the run configuration. Because according to the documentation, in Configuration/Training data sampling,data
section has only two elements: path
and distribution
train_dir = 'train'
if 'data' in config :
if 'train_dir' in config['data']:
train_dir = config['data']['train_dir']
- although the container still ran for a few seconds and terminated itself, either directory
/data-1/xyye/workspace
or directory/data-1/xyye/models
should not be empty.
Is there any way to save the training result for serving or interfacing later?
The following environment variables should not be passed to the Docker container. They should only be set when you want to change the mounted directory path inside the container (e.g. /root/corpus
, /root/models
, etc.)
-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \
Also, you most likely don't need to set the storage configuration and the -ms
option (which defaults to /root/models
).
Regarding your questions:
- The code you are referring to also implements backward compatibility with some old configurations.
- The trained model will be saved to
/root/models
which you configured to be/data-1/xyye/models
on the host.
still, i can't run nmtwizard/opennmt-tf.
But, i managed to run tensorflow:1.13.1-gpu-py3 directly, went inside this container and follow the quickstart of OpenNMT. GPU was also used, it worked perfectly. it seems this is a better idea than nmtwizard/opennmt-tf.
Sure, it's easier to directly use OpenNMT-tf for mose use cases.
nmt-wizard-docker offers some additional features for model production: ensuring preprocessing consistency, chaining training and translation tasks, direct path to model serving, support for remote filesystems, etc.