EleutherAI/gpt-neox

LLama mlp project layers missmatch with HF config during conversion

Vmjkom opened this issue · 3 comments

Describe the bug
When i try to convert a neox trained LLAMA model (config below) with convert_neox_to_hf.py i get the error showcased in the screenshot.
So in my view, during training, the dimension of the mlp layers don't get configured correctly. I hadn't come across this issue at least before #1212.

To Reproduce
Train a model with the provided config and try to convert it to Huggingface format.

Proposed solution
I would look at #1276 and #1212 for possible issues regarding LLAMA and mlp which could let to the forementioned problem.
One could also revert back to the LLAMAParallelMLP class and mlp_type: "llama" parameter combination from before.

Screenshots
image

Environment (please complete the following information):

  • GPUs: 2x8 MI250X (amd)
  • Configs:
    Libraries:
    deepspeed @ git+https://github.com/EleutherAI/DeeperSpeed.git@02e2ebf7dee6aaab3d89094ed470a4609763c742 flash-attn @ file:///opt/wheels/flash_attn-2.0.4-cp310-cp310-linux_x86_64.whl#sha256=0dc568c7b3516cc3f45f33858fe5ef048e5b7a82ba56c89189d5f6a97f4574f2 ftfy==6.2.3 lion-pytorch==0.1.4 lm-dataformat @ git+https://github.com/EleutherAI/lm_dataformat.git@4eec05349977071bf67fc072290b95e31c8dd836 lm_eval==0.4.1 mpi4py @ file:///opt/wheels/mpi4py-3.1.4-cp310-cp310-linux_x86_64.whl#sha256=6e012d8c61c0a0d8d6e93b4d98ba6946bb5a5c3d8280d1e0db93862ec19025c2 numpy==1.26.3 pybind11==2.13.6 pytorch-triton-rocm==2.2.0 regex==2024.5.15 sentencepiece==0.2.0 six==1.16.0 tiktoken==0.7.0 tokenizers==0.15.2 torch==2.2.2+rocm5.6 torchaudio==2.2.2+rocm5.6 torchdata==0.7.1 torchtext==0.17.2+cpu torchvision==0.17.2+rocm5.6 transformers==4.38.0 Python 3.10.13
{
  # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
  # across the node boundaries )
  "pipe_parallel_size": 0,
  "model_parallel_size": 1,

  "seed": 42,

  #Tokenizer
  "make_vocab_size_divisible_by": 1,
  "tokenizer_type": "GPT2BPETokenizer",
  "data_path": "/scratch/project_462000353/jburdge/data/fineweb-edu-100B/tokenized/gpt2_text_document",
  "vocab_file": "/scratch/project_462000353/tokenizers/gpt2/vocab.json",
  "merge_file": "/scratch/project_462000353/tokenizers/gpt2/merges.txt",

  # model settings
  "num_layers": 24,
  "hidden_size": 2048,
  "num_attention_heads": 32,
  "seq_length": 2048,
  "max_position_embeddings": 2048,
  "norm": "rmsnorm",
  "rms_norm_epsilon": 1.0e-05,
  "pos_emb": "rotary",
  "intermediate_size": 8192,
  "no_weight_tying": true,
  "gpt_j_residual": false,
  "output_layer_parallelism": "column",
  "num_kv_heads": 32,

  "scaled_upper_triang_masked_softmax_fusion": false,
  "bias_gelu_fusion": false,
  "use_bias_in_norms": false,
  "use_bias_in_attn_linear": false,
  "activation": "swiglu",
  "use_flashattn_swiglu": true,
  "mlp_multiple_of": 1,
  "use_bias_in_mlp": false,

  #flash_attention - value = num_layers
  "attention_config": [[["flash"], 24]],

  # init methods
  "init_method": "small_init",
  "init_method_std": 0.02,
  "output_layer_init_method": "wang_init",

  # optimizer settings
  "optimizer":
    {
      "type": "Adam",
      "params": { "lr": 3.0e-4, "betas": [0.9, 0.95], "eps": 1.0e-8 },
    },
  "min_lr": 3.0e-5,

  # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
  "zero_optimization":
    {
      "stage": 0,
      "allgather_partitions": True,
      "allgather_bucket_size": 50000000,
      "overlap_comm": True,
      "reduce_scatter": false,
      "reduce_bucket_size": 50000000,
      "contiguous_gradients": True,
    },

  # batch / data settings
  "train_micro_batch_size_per_gpu": 32,
  "gradient_accumulation_steps": 2,
  "data_impl": "mmap",

  # activation checkpointing
  "checkpoint_activations": true,
  "checkpoint_num_layers": 1,
  "partition_activations": false,
  "synchronize_each_layer": false,

  # regularization
  "gradient_clipping": 1.0,
  "weight_decay": 0.1,
  "hidden_dropout": 0.0,
  "attention_dropout": 0.0,

  # precision settings
  "precision": "bfloat16",
  "fp32_allreduce": true,

  # misc. training settings
  "train_iters": 10,
  "lr_decay_iters": 10,
  "distributed_backend": "nccl",
  "lr_decay_style": "cosine",
  "warmup": 0.01,

  #Evaluation
  "eval_interval": 10,
  "eval_iters": 5,

  #Dataloader workers
  "num_workers": 2,

  #Checkpoints
  "checkpoint_factor": 10,
  "keep_last_n_checkpoints": 1,
  "save": "/scratch/project_462000353/villekom/checkpoints/neox/debug/",
  #"load": "/scratch/project_462000353/villekom/checkpoints/neox/debug/",

  # logging
  "log_interval": 1,
  "steps_per_print": 1,
  "tensorboard_dir": "logs/tb/",
  "log_grad_pct_zeros": True,
  "log_grad_norm": True,
  "log_gradient_noise_scale": False, #Gradient Noise Scale logging does not work with zero stage 2+, as the gradients are distributed across ranks.

  #Deepspeed misc
  "wall_clock_breakdown": true,
  "tensorboard": { "enabled": false, "output_path": "logs/tb/" },
  "comms_logger":
    { "enabled": false, "verbose": false, "prof_all": true, "debug": False },
}

Additional context
Add any other context about the problem here.

I also encountered this issue in the llama-type MLP, and I had to set the 'intermediate_size' to three times the intended value to deal with it.
I made a pull request (#1309) which fixed the llama configuations in the 'example' directories. I hope this helps.

This should be resolved now that #1309 is merged. Reopen if this isn't the case for you!