I tried to run, but after start pretraining task, the process kills itself. can you help?

Question

I tried to run, but after start pretraining task, the process kills itself. can you help?

usccolumbia opened this issue 3 years ago · 1 comments

roteinlm)xxxx@quant:~/ProteinLM/pretrain$ sh examples/pretrain_tape.sh
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
WARNING: overriding default arguments for tokenizer_type:BertWordPieceLowerCase with tokenizer_type:BertWordPieceCase
------------------------ arguments ------------------------
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
bert_load ....................................... None
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
block_data_path ................................. None
checkpoint_activations .......................... False
checkpoint_num_layers ........................... 1
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... mmap
data_parallel_size .............................. 1
data_path ....................................... ['my-tape_text_sentence']
DDP_impl ........................................ local
distribute_checkpointed_activations ............. False
distributed_backend ............................. nccl
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 10
exit_duration_in_mins ........................... None
exit_interval ................................... None
faiss_use_gpu ................................... False
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_allreduce .................................. False
fp32_residual_connection ........................ False
global_batch_size ............................... 8
hidden_dropout .................................. 0.1
hidden_size ..................................... 768
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
init_method_std ................................. 0.02
initial_loss_scale .............................. 4294967296
layernorm_epsilon ............................... 1e-12
lazy_mpu_init ................................... None
load ............................................ ./checkopoint
local_rank ...................................... None
log_interval .................................... 100
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0001
lr_decay_iters .................................. 990000
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. 0.01
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
max_position_embeddings ......................... 2176
merge_file ...................................... None
micro_batch_size ................................ 4
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_warmup ..................................... False
no_load_optim ................................... False
no_load_rng ..................................... False
no_save_optim ................................... False
no_save_rng ..................................... False
num_attention_heads ............................. 12
num_layers ...................................... 12
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
override_lr_scheduler ........................... False
params_dtype .................................... torch.float16
pipeline_model_parallel_size .................... 1
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
report_topk_accuracies .......................... []
reset_attention_mask ............................ False
reset_position_ids .............................. False
save ............................................ ./checkopoint
save_interval ................................... 10000
scaled_masked_softmax_fusion .................... True
scaled_upper_triang_masked_softmax_fusion ....... None
seed ............................................ 1234
seq_length ...................................... 2176
short_seq_prob .................................. 0.1
split ........................................... 32593668,1715454,44311
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
titles_data_path ................................ None
tokenizer_type .................................. BertWordPieceCase
train_iters ..................................... 2000000
train_samples ................................... None
use_checkpoint_lr_scheduler ..................... False
use_cpu_initialization .......................... False
use_one_sent_docs ............................... False
vocab_file ...................................... ./protein_tools/iupac_vocab.txt
weight_decay .................................... 0.01
world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 2

building BertWordPieceCase tokenizer ...
padded vocab (size: 31) with 97 dummy tokens (new size: 128)
initializing torch distributed ...
initializing tensor model parallel with size 1
initializing pipeline model parallel with size 1
setting random seeds to 1234 ...
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
time to initialize megatron (seconds): 74.673
[after megatron is initialized] datetime: 2022-02-09 00:02:02
building TAPE model ...
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 87417728
learning rate decay style: linear
WARNING: could not find the metadata file ./checkopoint/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
time (ms) | load checkpoint: 10.21
[after model, optimizer, and learning rate scheduler are built] datetime: 2022-02-09 00:02:02
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 16000000
validation: 160080
test: 80
building train, validation, and test datasets for TAPE ...
building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
finished creating indexed dataset in 0.013824 seconds
number of documents: 32593668
dataset split:
train:
document indices in [0, 30924048) total of 30924048 documents
validation:
document indices in [30924048, 32551627) total of 1627579 documents
test:
document indices in [32551627, 32593668) total of 42041 documents
WARNING: could not find index map files, building the indices on rank 0 ...
last epoch number of samples (26365) is larger than 80% of number of samples per epoch (28422), setting separate_last_epoch to False
Killed

Answer 1 · 2022-10-11T22:03:31.000Z

Hi @usccolumbia ,

Sorry for the late reply.

Interesting observation! Did you use the default corpus and bash script we provided?

Best,
Yijia