llama2 70B模型在不同PP下loss下降趋势不同
Opened this issue · 1 comments
ZLkanyo009 commented
在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=8
# Change for multinode config
MASTER_ADDR=192.167.5.2
MASTER_PORT=29501
NUM_NODES=4
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
CHECKPOINT_PATH='/data/zhangling21/ckpts/'
TENSORBOARD_LOGS_PATH='/data/zhangling21/tensorboard_logs/'
TOKENIZER_PATH='/data/zhangling21/llama_00_text_document/tokenizer/tokenizer.model'
DATA_PATH='/data/zhangling21/llama_00_text_document/llama_00_text_document'
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NUM_NODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
# --tokenizer-type LLaMASentencePieceTokenizer \
# --rmsnorm-epsilon 1e-5
LLAMA_MODEL_ARGS=(
--num-layers 8
--hidden-size 8192
--ffn-hidden-size 28672
--num-attention-heads 64
--seq-length 4096
--max-position-embeddings 4096
--group-query-attention
--num-query-groups 8
--tokenizer-type Llama2Tokenizer
--tokenizer-model $TOKENIZER_PATH
--swiglu
--normalization RMSNorm
--use-rotary-position-embeddings
--no-position-embedding
--disable-bias-linear
)
# --optimizer adam
# --adam-eps 1e-05
# --no-contiguous-buffers-in-local-ddp
# --recompute-method uniform
# --no-async-tensor-model-parallel-allreduce
# --embedding-dropout 0
# --multi-query-attention
# --multi-query-group-num 8
# --ffn-dim-multiplier 1.3
# --recompute-granularity full
# --distribute-saved-activations
# --recompute-num-layers 1
# --memory-saving
# --fp16
# --optimizer adam
# --adam-eps 1e-05
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 44
--train-samples 24414
--weight-decay 1e-2
--optimizer adam
--clip-grad 1.0
--lr 0.00015
--lr-decay-style cosine
--min-lr 1.0e-5
--lr-warmup-fraction .01
--adam-beta1 0.9
--adam-beta2 0.95
--attention-dropout 0.0
--hidden-dropout 0.0
--untie-embeddings-and-output-weights
--multiple-of 4096
--no-gradient-accumulation-fusion
--recompute-granularity 'full'
--recompute-num-layers 1
--recompute-method 'uniform'
--no-async-tensor-model-parallel-allreduce
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 4
)
DATA_ARGS=(
--data-path $DATA_PATH
--split 1
)
EVAL_AND_LOGGING_ARGS=(
--log-interval 1
--init-method-std 0.02
--seed 1234
--eval-iters 0
--use-cpu-initialization
)
#--load "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
#--no-load-rng
#--save "/data/zhangling21/llama_00_text_document/ckpt0227_8L"
#--save-interval 1
cmd="torchrun ${DISTRIBUTED_ARGS[@]} pretrain_llama.py \
${LLAMA_MODEL_ARGS[@]} \
${TRAINING_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${DATA_ARGS[@]} \
${EVAL_AND_LOGGING_ARGS[@]}"
echo $cmd
eval $cmd
ZLkanyo009 commented
@zhaoyinglia 您好,能麻烦看一下这个问题吗? @aoyulong 最近比较忙