Error while running training bash command

Question

Error while running training bash command

Closed this issue a year ago · 7 comments

Hi, I am getting following error while running the training bash command.
I am running
bash scripts/wikitabletext/train_had.sh data/wikitabletext bart.base/

custom_train.py: error: unrecognized arguments: data/wikitabletext/bins --source-lang text --target-lang data --truncate-source
ls: cannot access 'checkpoints//pt': No such file or directory
ls: cannot access 'checkpoints//checkpointbest_*pt': No such file or directory

2023-06-28 12:50:08.829246: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 12:50:09.695034: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-06-28 12:50:11 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
usage: average_checkpoints.py
[-h]
--inputs
INPUTS
[INPUTS ...]
--output
FILE
[--num-epoch-checkpoints NUM_EPOCH_CHECKPOINTS | --num-update-checkpoints NUM_UPDATE_CHECKPOINTS]
[--checkpoint-upper-bound CHECKPOINT_UPPER_BOUND]
average_checkpoints.py: error: argument --inputs: expected at least one argument

Answer 1 · 2023-06-28T12:58:39.000Z

Hi, could you confirm your fairseq version is v0.10.2? Also could you check the file checkpoints/log to see if there's more error log in it? If there is, could you also post it here? Thanks!

Answer 2 · 2023-06-28T14:48:49.000Z

fairseq version 0.10.2

I checked checkpoints/log and this is the log-

Traceback (most recent call last):
File "custom_train.py", line 463, in
cli_main()
File "custom_train.py", line 459, in cli_main
distributed_utils.call_main(args, main)
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/distributed/utils.py", line 335, in call_main
if cfg.distributed_training.distributed_init_method is None:
AttributeError: 'Namespace' object has no attribute 'distributed_training'

Answer 3 · 2023-06-28T15:14:56.000Z

It seems the argument parsing is incorrect. Since I directly import the parsing function from fairseq, it's not supposed to happen when the environment is correct.

Could you try re-installing fairseq==v0.10.2, or alternatively try downloading the source code, checkout to v0.10.2, and directly use the source code fairseq rather than your installed package?

Also, I'm suspecting your environment is somehow messed up, because v0.10.2 fairseq doesn't have the file distributed/utils.py (as in your log). The file distributed/utils.py is post v0.12; v0.10.2 only has the file distributed_utils.py

Answer 4 · 2023-06-29T13:25:05.000Z

Hi, I have done some changes. Now I am getting this error-

Traceback (most recent call last):
File "/workspace/data/text_to_table/custom_train.py", line 463, in
cli_main()
File "/workspace/data/text_to_table/custom_train.py", line 452, in cli_main
parser = options.get_training_parser()
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/options.py", line 34, in get_training_parser
parser = get_parser("Trainer", default_task)
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/options.py", line 210, in get_parser
utils.import_user_module(usr_args)
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/utils.py", line 446, in import_user_module
raise FileNotFoundError(module_path)
FileNotFoundError: /workspace/data/text_to_table/src
ls: cannot access '/workspace/data/text_to_table/checkpoints//pt': No such file or directory
ls: cannot access '/workspace/data/text_to_table/checkpoints//checkpointbest_*pt': No such file or directory

usage: average_checkpoints.py [-h] --inputs INPUTS [INPUTS ...] --output FILE
[--num-epoch-checkpoints NUM_EPOCH_CHECKPOINTS | --num-update-checkpoints NUM_UPDATE_CHECKPOINTS]
[--checkpoint-upper-bound CHECKPOINT_UPPER_BOUND]
average_checkpoints.py: error: argument --inputs: expected at least one argument

this is my log in checkpoint-
Traceback (most recent call last):
File "/workspace/data/text_to_table/custom_train.py", line 463, in
cli_main()
File "/workspace/data/text_to_table/custom_train.py", line 452, in cli_main
parser = options.get_training_parser()
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/options.py", line 34, in get_training_parser
parser = get_parser("Trainer", default_task)
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/options.py", line 210, in get_parser
utils.import_user_module(usr_args)
File "/home/gautam8/.local/lib/python3.6/site-packages/fairseq/utils.py", line 446, in import_user_module
raise FileNotFoundError(module_path)
FileNotFoundError: /workspace/data/text_to_table/src

this last argument I am passing in --user-id in train_had.sh

Can you tell me what is inputs here? and what is --user-id in train_had.sh

Answer 5 · 2023-06-29T13:32:08.000Z

Hi, the script was trying to import the code under --user-dir argument but cannot find the path. --user-dir should be set as src/. That's where I implemented my custom task and model.

Have you properly set this argument? Are you running the script under the root directory of this repo? If not, you need to set the argument as the absolute path.

I don't see the --user-id argument. Are you referring to --user-dir?

Answer 6 · 2023-06-29T13:53:47.000Z

Oh yes, sorry I mean --user-dir.
I am using the DGX2 server, where I need to give an absolute path, that's why I am changing it everywhere.

Answer 7 · 2023-09-06T14:05:19.000Z

Closing this issue due to inactivity. Please let me know if you have other problems