use nmtwizard/opennmt-tf image to train a Transformer with GPU

Question

use nmtwizard/opennmt-tf image to train a Transformer with GPU

Mercy811 opened this issue 5 years ago · 6 comments

Mercy811 commented 5 years ago

overview

AIM: use nmtwizard/opennmt-tf image to train a Transformer with GPU

the container only runs for a few seconds and stop automatically

is there any way to observe running processess that is using gpu?
the result of nvidia-smi is always like this:

according to the latest Nvidia CUDA toolkit, it is same to use docker run --gpus all as nvidia-docker run. so it seems not the problem

details

local machine directory strucure:

/data-1/xyye
    |
    |—corpus
    |       |— test-infoq
    |       |       |—train
    |       |       |     |—src-train-infoq.en
    |       |       |     |—tgt-train-infoq.zh
    |       |       |
    |       |       |—vocab
    |       |             |--src-vocab-infoq.txt
    |       |             |—tgt-vocab-infoq.txt
    |       |
    |       |— toy-ende
    |
    |—config
    |       |—run-config.json
    |
    |—models
    |
    |—workspace

run-config.json
content in option is written according to OpenNMT-tf Document

    "source": "en",  
    "target": “zh",
    "data": {
    },
    "tokenization": {
    },
    "options": {
        "mode_type": “Transformer”
        "data": {
            "train_features_file": "${CORPUS_DIR}/train/src-train-infoq.en",
            "train_labels_file": "${CORPUS_DIR}/train/tgt-train-infoq.zh",
            "source_vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt",
            "target_vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"
        }
        "config": {
            "params": {
                "optimizer": "GradientDescentOptimizer",
                "learning_rate": 1,
                "param_init": 0.1,
                "clip_gradients": 5.0,
                "beam_width": 5
            },
            "train": {
                "batch_size": 64,
                "bucket_width": 2,
                "maximum_features_length": 50,
                "maximum_labels_length": 50,
                "save_checkpoints_steps": 5000,
                "keep_checkpoint_max": 8
            }
        }
    }
}

command line

cat run-config.json | docker run --gpus all -a STDIN -i -—rm \
-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \
-v $CORPUS_DIR/test-infoq:/root/corpus \
-v $MODELS_DIR:/root/models \
nmtwizard/opennmt-tf \
-c - -ms /root/models -g 1 train

Answer 1 · 2019-10-09T07:52:15.000Z

I've already set persistent environment variables in ~/.bashrc

export CORPUS_DIR='/data-1/xyye/corpus'
export MODELS_DIR='/data-1/xyye/models'
export WORKSPACE_DIR='/data-1/xyye/workspace'

Answer 2 · 2019-10-09T13:29:54.000Z

Most of you setup looks correct, except the data configuration which should not happen in the "options" section, as it is prepared outside of OpenNMT-tf.

Try something like:

{
    "source": "en",  
    "target": "zh",
    "tokenization": {
        "source": {"vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt"},
        "target": {"vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"}
    },
    "options": {
        "mode_type": "Transformer",
        "config": {
            "params": {
                "optimizer": "GradientDescentOptimizer",
                "learning_rate": 1,
                "param_init": 0.1,
                "clip_gradients": 5.0,
                "beam_width": 5
            },
            "train": {
                "batch_size": 64,
                "bucket_width": 2,
                "maximum_features_length": 50,
                "maximum_labels_length": 50,
                "save_checkpoints_steps": 5000,
                "keep_checkpoint_max": 8
            }
        }
    }
}

Answer 3 · 2019-10-11T07:53:51.000Z

Thx for your help! But here are still some problems

what i changeed

to make it simple, i did some little changes to local directory strucure

/data-1/xyye
    |
    |—corpus
    |       |—train
    |       |       |—src-train-infoq.en
    |       |       |—tgt-train-infoq.zh
    |       |
    |       |—vocab
    |               |—src-vocab-infoq.txt
    |               |—tgt-vocab-infoq.txt
    |
    |—config
    |       |—run-config.json
    |
    |—models
    |
    |—workspace

command line:

mount workspace here
use --storage_config

cat run-config.json | docker run  --name=logtest --gpus all -a STDIN -i \
-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \
-v $CORPUS_DIR:/root/corpus \
-v $MODELS_DIR:/root/models \
-v $WORKSPACE_DIR:/root/workspace \
-v /data-1/xyye/config:/root/config \
nmtwizard/opennmt-tf \
-c - -ms /root/models -g 1 --storage_config /root/config/storage-config.json train

run-config.json as you commented above

{
    "source": "en",
    "target": "zh",
    "tokenization": {
	"source": {"vocabulary": "${CORPUS_DIR}/vocab/src-vocab-infoq.txt"},
	"target": {"vocabulary": "${CORPUS_DIR}/vocab/tgt-vocab-infoq.txt"}
    },
    "options": {
        "mode_type": "Transformer",
        "config": {
            "params": {
                "optimizer": "GradientDescentOptimizer",
                "learning_rate": 1,
                "param_init": 0.1,
                "clip_gradients": 5.0,
                "beam_width": 5
            },
            "train": {
                "batch_size": 64,
                "bucket_width": 2,
                "maximum_features_length": 50,
                "maximum_labels_length": 50,
                "save_checkpoints_steps": 5000,
                "keep_checkpoint_max": 8
            }
        }
    }
}

got the following error message

2019-10-11 07:11:01Z.000607 [beat_service] WARNING start_beat_service: CALLBACK_URL or task_id is unset; beat service will be disabled
2019-10-11 07:11:01Z.000607 [utility] INFO run: Starting executing utility NMT framework=?
2019-10-11 07:11:01Z.000607 [framework] INFO train_wrapper: Starting training model 884c6a73-d7af-4587-96a8-4d4ccc78bd92
2019-10-11 07:11:01Z.000607 [preprocess] WARNING generate_preprocessed_data: No 'data' field in configuration,                         default corpus directory and all corpora are used.)
2019-10-11 07:11:01Z.000608 [framework] INFO _merge_multi_training_files: Merging training data to /data-1/xyye/workspace/data/merged/train/train.{en,zh}
Traceback (most recent call last):
  File "entrypoint.py", line 241, in <module>
    OpenNMTTFFramework().run()
  File "/root/nmtwizard/utility.py", line 203, in run
    stats = self.exec_function(args)
  File "/root/nmtwizard/framework.py", line 296, in exec_function
    push_model=not self._no_push)
  File "/root/nmtwizard/framework.py", line 382, in train_wrapper
    self._build_data(local_config))
  File "/root/nmtwizard/framework.py", line 874, in _build_data
    data_dir, train_dir, config['source'], config['target'])
  File "/root/nmtwizard/framework.py", line 884, in _merge_multi_training_files
    data_util.merge_files_in_directory(data_path, merged_path, source, target)
  File "/root/nmtwizard/data.py", line 20, in merge_files_in_directory
    files = [f for f in os.listdir(input_dir) if os.path.isfile(os.path.join(input_dir, f))]
OSError: [Errno 2] No such file or directory: '/data-1/xyye/corpus/train'

questions

i traced this error in source code, but didn't figure it out. Even came up with a new question

the value of input_dir in

  File "/root/nmtwizard/data.py", line 20, in merge_files_in_directory
    files = [f for f in os.listdir(input_dir) if os.path.isfile(os.path.join(input_dir, f))]

was set in this function - preprocess.generate_preprocessed_data. I am wondering why train_dir = config['data']['train_dir'], if there is a data seciton of the run configuration. Because according to the documentation, in Configuration/Training data sampling,data section has only two elements: path and distribution

    train_dir = 'train'
    if 'data' in config :
        if 'train_dir' in config['data']:
            train_dir = config['data']['train_dir']

although the container still ran for a few seconds and terminated itself, either directory /data-1/xyye/workspace or directory /data-1/xyye/models should not be empty.
Is there any way to save the training result for serving or interfacing later?

Answer 4 · 2019-10-14T07:43:46.000Z

The following environment variables should not be passed to the Docker container. They should only be set when you want to change the mounted directory path inside the container (e.g. /root/corpus, /root/models, etc.)

-e MODELS_DIR \
-e WORKSPACE_DIR \
-e CORPUS_DIR \

Also, you most likely don't need to set the storage configuration and the -ms option (which defaults to /root/models).

Regarding your questions:

The code you are referring to also implements backward compatibility with some old configurations.
The trained model will be saved to /root/models which you configured to be /data-1/xyye/models on the host.

Answer 5 · 2019-10-17T02:13:12.000Z

still, i can't run nmtwizard/opennmt-tf.

But, i managed to run tensorflow:1.13.1-gpu-py3 directly, went inside this container and follow the quickstart of OpenNMT. GPU was also used, it worked perfectly. it seems this is a better idea than nmtwizard/opennmt-tf.

Answer 6 · 2019-10-17T07:56:28.000Z

Sure, it's easier to directly use OpenNMT-tf for mose use cases.

nmt-wizard-docker offers some additional features for model production: ensuring preprocessing consistency, chaining training and translation tasks, direct path to model serving, support for remote filesystems, etc.