mit-han-lab/hardware-aware-transformers

Error in step 2.3 (Evolutionary search with latency constraint)

ihish52 opened this issue · 5 comments

Hi,

When following the code step by step, I get an error when running the evolutionary search. The error is:
"_th_admm_out not supported on CPUType for Half"

Do you know what could be causing this and how to fix it? I am currently running this for my i5 CPU. Does the config file need any change to avoid using the GPU when only the CPU is being tested?

Help with this would be highly appreciated. Thanks.

Hi ihish52,

Thanks for your question!
Could you provide more details about your command and which line caught the error?

Thanks for the quick reply. Attached is the config file I am using to perform the evolutionary search for my i5 CPU (barely any change from your example). There are no NVIDIA drivers installed so I do not think the GPU is affecting this.

wmt14ende_i5.zip

Below is the output when I run the command for evo_search.py:

python3 evo_search.py --configs=configs/wmt14.en-de/supertransformer/space0.yml --evo-configs=configs/wmt14.en-de/evo_search/wmt14ende_i5.yml

Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformersuper_wmt_en_de', attention_dropout=0.1, beam=5, best_checkpoint_metric='loss', bucket_cap_mb=25, ckpt_path='./latency_dataset/predictors/wmt14ende_cpu_i5.pt', clip_norm=0.0, configs='configs/wmt14.en-de/supertransformer/space0.yml', cpu=False, criterion='label_smoothed_cross_entropy', crossover_size=50, curriculum=0, data='data/binary/wmt16_en_de', dataset_impl=None, ddp_backend='no_c10d', decoder_arbitrary_ende_attn_all_subtransformer=None, decoder_arbitrary_ende_attn_choice=[-1, 1, 2], decoder_attention_heads=8, decoder_embed_choice=[640, 512], decoder_embed_dim=640, decoder_embed_dim_subtransformer=None, decoder_embed_path=None, decoder_ende_attention_heads_all_subtransformer=None, decoder_ende_attention_heads_choice=[8, 4], decoder_ffn_embed_dim=3072, decoder_ffn_embed_dim_all_subtransformer=None, decoder_ffn_embed_dim_choice=[3072, 2048, 1024], decoder_input_dim=640, decoder_layer_num_choice=[6, 5, 4, 3, 2, 1], decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=640, decoder_self_attention_heads_all_subtransformer=None, decoder_self_attention_heads_choice=[8, 4], device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, diverse_beam_groups=-1, diverse_beam_strength=0.5, dropout=0.3, encoder_attention_heads=8, encoder_embed_choice=[640, 512], encoder_embed_dim=640, encoder_embed_dim_subtransformer=None, encoder_embed_path=None, encoder_ffn_embed_dim=3072, encoder_ffn_embed_dim_all_subtransformer=None, encoder_ffn_embed_dim_choice=[3072, 2048, 1024], encoder_layer_num_choice=[6], encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_self_attention_heads_all_subtransformer=None, encoder_self_attention_heads_choice=[8, 4], evo_configs='configs/wmt14.en-de/evo_search/wmt14ende_i5.yml', evo_iter=30, feature_norm=[640.0, 6.0, 2048.0, 6.0, 640.0, 6.0, 2048.0, 6.0, 6.0, 2.0], find_unused_parameters=False, fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, get_attn=False, keep_interval_updates=-1, keep_last_epochs=20, label_smoothing=0.1, lat_norm=700.0, latency_constraint=6000.0, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1, log_format=None, log_interval=1000, lr=[1e-07], lr_period_updates=-1, lr_scheduler='cosine', lr_shrink=1.0, match_source_len=False, max_epoch=0, max_len_a=0, max_len_b=200, max_lr=0.001, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_tokens_valid=4096, max_update=40000, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, min_lr=-1, model_overrides='{}', mutation_prob=0.3, mutation_size=50, nbest=1, no_beamable_mm=False, no_early_stop=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_repeat_ngram_size=0, no_save=False, no_save_optimizer_state=False, no_token_positional_embeddings=False, num_workers=10, optimizer='adam', optimizer_overrides='{}', parent_size=25, path=None, pdb=False, population_size=125, prefix_size=0, print_alignment=False, profile_latency=False, qkv_dim=512, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='./downloaded_models/HAT_wmt14ende_super_space0.pt', results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, save_dir='checkpoints/wmt14.en-de/supertransformer/space0', save_interval=10, save_interval_updates=0, score_reference=False, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, t_mult=1, target_lang=None, task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='checkpoints/wmt14.en-de/supertransformer/space0/tensorboard', threshold_loss_scale=None, train_subset='train', unkpen=0, unnormalized=False, update_freq=[16], upsample_primary=1, use_bmuf=False, user_dir=None, valid_cnt_max=1000000000.0, valid_subset='valid', validate_interval=10, vocab_original_scaling=False, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.0, write_config_path='configs/wmt14.en-de/subtransformer/wmt14ende_i5.yml')
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
| loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.en
| loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.de
| data/binary/wmt16_en_de valid en-de 3000 examples
| Fallback to xavier initializer
TransformerSuperModel(
(encoder): TransformerEncoder(
(embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): EmbeddingSuper(32768, 640, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(self_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttentionSuper num_heads:8 qkv_dim:512
(out_proj): LinearSuper(in_features=512, out_features=640, bias=True)
)
(encoder_attn_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
(fc1): LinearSuper(in_features=640, out_features=3072, bias=True)
(fc2): LinearSuper(in_features=3072, out_features=640, bias=True)
(final_layer_norm): LayerNormSuper((640,), eps=1e-05, elementwise_affine=True)
)
)
)
)
| loaded checkpoint ./downloaded_models/HAT_wmt14ende_super_space0.pt (epoch 136 @ 0 updates)
| loading train data for epoch 136
| loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.en
| loaded 3000 examples from: data/binary/wmt16_en_de/valid.en-de.de
| data/binary/wmt16_en_de valid en-de 3000 examples
| Start Iteration 0:
Traceback (most recent call last):
File "evo_search.py", line 106, in
cli_main()
File "evo_search.py", line 102, in cli_main
main(args)
File "evo_search.py", line 51, in main
best_config = evolver.run_evo_search()
File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 217, in run_evo_search
popu_scores = self.get_scores(popu)
File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 281, in get_scores
scores = validate_all(self.args, self.trainer, self.task, self.epoch_iter, configs)
File "/home/hishan/hardware-aware-transformers/fairseq/evolution.py", line 401, in validate_all
trainer.valid_step(sample)
File "/home/hishan/hardware-aware-transformers/fairseq/trainer.py", line 451, in valid_step
raise e
File "/home/hishan/hardware-aware-transformers/fairseq/trainer.py", line 438, in valid_step
_loss, sample_size, logging_output = self.task.valid_step(
File "/home/hishan/hardware-aware-transformers/fairseq/tasks/fairseq_task.py", line 241, in valid_step
loss, sample_size, logging_output = criterion(model, sample)
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hishan/hardware-aware-transformers/fairseq/criterions/label_smoothed_cross_entropy.py", line 56, in forward
net_output = model(**sample['net_input'])
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hishan/hardware-aware-transformers/fairseq/models/fairseq_model.py", line 222, in forward
encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hishan/hardware-aware-transformers/fairseq/models/transformer_super.py", line 401, in forward
x = layer(x, encoder_padding_mask)
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hishan/hardware-aware-transformers/fairseq/models/transformer_super.py", line 900, in forward
x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 182, in forward
q, k, v = self.in_proj_qkv(query)
File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 314, in in_proj_qkv
return self._in_proj(query, sample_dim=self.sample_q_embed_dim).chunk(3, dim=-1)
File "/home/hishan/hardware-aware-transformers/fairseq/modules/multihead_attention_super.py", line 351, in _in_proj
return F.linear(input, weight, bias)
File "/home/hishan/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1676, in linear
output = input.matmul(weight.t())
RuntimeError: _th_addmm_out not supported on CPUType for Half

Still not sure what caused this. Made a new environment and installed all dependencies again, with versions exactly as stated in the requirements and it worked. Closing this issue. Thanks for the response.

Hi ihish52,

Sorry for my late reply, I was too busy in the past several weeks.
The reason for the error is that some methods on float16 are not supported by PyTorch CPU, so I fixed by using fp32 when performing the evolutionary search on CPU. (commit)

Thanks for your contribution!

Best,
Hanrui

Hi Hishan,

I will close the issue for now. Feel free to reopen if you have any further questions!

Best,
Hanrui