nebuly-ai/optimate

[Chatllama] train actor model with llama7B, the loss is nan

balcklive opened this issue · 0 comments

I mannuly split the model checkpoint into 8 splits. and train the llama model with 8 V100 GPUs. but strangely, the loss is nan. I trained successfully with same data on gpt2-xl model, so I think that's not a data problem.
can anybody figure out why?

ubuntu@ip-172-31-10-190:~/ubuntu$ torchrun --standalone --nnodes=1 --nproc-per-node=8 artifacts/main.py artifacts/config/config.yaml --type ACTOR
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.15) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Current device used :cuda
Start cleaning the dataset for Actor
Current device used :cuda
Current device used :cuda
Start cleaning the dataset for Actor
Start cleaning the dataset for Actor
Current device used :cuda
Current device used :cuda
Start cleaning the dataset for Actor
Current device used :cuda
Current device used :cuda
Current device used :cuda
Start cleaning the dataset for Actor
Start cleaning the dataset for Actor
Start cleaning the dataset for Actor
Start cleaning the dataset for Actor
Dataset is already clean
Dataset is already clean
Dataset is already clean
local_rank: 4 world_size: 8
Dataset is already clean
Dataset is already clean
Dataset is already clean
Dataset is already clean
local_rank: 0 world_size: 8
local_rank: 6 world_size: 8
local_rank: 5 world_size: 8
local_rank: 1 world_size: 8
Dataset is already clean
local_rank: 2 world_size: 8
local_rank: 7 world_size: 8
local_rank: 3 world_size: 8

initializing model parallel with size 8
initializing ddp with size 1
initializing pipeline with size 1
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
No previous model found at /home/ubuntu/ubuntu/pyllama_data1/7B/actor for model llama-7B.pt
[2023-03-31 03:50:04,515] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,533] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,600] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,614] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,655] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,685] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,686] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:04,750] [INFO] [logging.py:77:log_dist] [Rank -1] DeepSpeed info: version=0.8.2, git-hash=unknown, git-branch=unknown
[2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-03-31 03:50:06,138] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-03-31 03:50:06,152] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-03-31 03:50:06,152] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-31 03:50:06,152] [INFO] [logging.py:77:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2023-03-31 03:50:06,306] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-03-31 03:50:06,307] [INFO] [utils.py:830:see_memory_usage] MA 14.55 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:06,307] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 21.76 GB, percent = 2.9%
[2023-03-31 03:50:06,308] [INFO] [stage3.py:113:init] Reduce bucket size 100
[2023-03-31 03:50:06,309] [INFO] [stage3.py:114:init] Prefetch bucket size 0
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.25902438163757324 seconds
Loading extension module utils...
Time to load utils op: 0.10285019874572754 seconds
Loading extension module utils...
Time to load utils op: 0.20204687118530273 seconds
Loading extension module utils...
Time to load utils op: 0.20221972465515137 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.30254244804382324 seconds
Time to load utils op: 0.3023359775543213 seconds
Loading extension module utils...
Time to load utils op: 0.3025219440460205 seconds
Loading extension module utils...
Time to load utils op: 0.30247020721435547 seconds
[2023-03-31 03:50:07,142] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-03-31 03:50:07,143] [INFO] [utils.py:830:see_memory_usage] MA 14.55 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:07,143] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 22.86 GB, percent = 3.1%
Parameter Offload: Total persistent parameters: 0 in 0 params
[2023-03-31 03:50:09,626] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-03-31 03:50:09,627] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 14.55 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:09,627] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 38.78 GB, percent = 5.2%
[2023-03-31 03:50:09,708] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-03-31 03:50:09,708] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:09,708] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 38.78 GB, percent = 5.2%
[2023-03-31 03:50:12,637] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 9
[2023-03-31 03:50:12,638] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:12,638] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 57.15 GB, percent = 7.6%
[2023-03-31 03:50:12,756] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-03-31 03:50:12,757] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:12,757] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 58.49 GB, percent = 7.8%
[2023-03-31 03:50:15,870] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-03-31 03:50:15,870] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:15,871] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 80.31 GB, percent = 10.7%
[2023-03-31 03:50:15,985] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-03-31 03:50:15,986] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:15,986] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 81.74 GB, percent = 10.9%
[2023-03-31 03:50:36,904] [INFO] [utils.py:829:see_memory_usage] After initializing optimizer states
[2023-03-31 03:50:36,905] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.0 GB CA 14.57 GB Max_CA 15 GB
[2023-03-31 03:50:36,905] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 167.03 GB, percent = 22.3%
[2023-03-31 03:50:38,351] [INFO] [stage3.py:376:_setup_for_real_optimizer] optimizer state initialized
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0007975101470947266 seconds
Training with DeepSpeed
Start Actor Model Pretraining
Looking for checkpoints...
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Time to load utils op: 0.0007760524749755859 seconds
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Training with DeepSpeed
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Start Actor Model Pretraining
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Looking for checkpoints...
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt

Time to load utils op: 0.0008547306060791016 seconds

Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Training with DeepSpeed
Start Actor Model Pretraining
Time to load utils op: 0.0008461475372314453 seconds
No modifications detected for re-loaded extension module utils, skipping build step...Looking for checkpoints...

Loading extension module utils...
Time to load utils op: 0.0008032321929931641 seconds
Training with DeepSpeed
Time to load utils op: 0.0009899139404296875 secondsNo previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt

Training with DeepSpeed
Start Actor Model Pretraining
Looking for checkpoints...
Start Actor Model Pretraining
Training with DeepSpeed
Time to load utils op: 0.0009548664093017578 secondsLooking for checkpoints...

Start Actor Model Pretraining
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
Looking for checkpoints...
Training with DeepSpeed
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
Start Actor Model Pretraining
Looking for checkpoints...
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
[2023-03-31 03:50:41,801] [INFO] [utils.py:829:see_memory_usage] After initializing ZeRO optimizer
[2023-03-31 03:50:41,802] [INFO] [utils.py:830:see_memory_usage] MA 2.0 GB Max_MA 2.49 GB CA 14.82 GB Max_CA 15 GB
[2023-03-31 03:50:41,802] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory: used = 179.5 GB, percent = 24.0%
[2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.CosineAnnealingWarmRestarts object at 0x7f812de03340>
[2023-03-31 03:50:41,802] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[9e-06], mom=[(0.9, 0.999)]
[2023-03-31 03:50:41,803] [INFO] [config.py:1010:print] DeepSpeedEngine configuration:
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] amp_enabled .................. False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] amp_params ................... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] bfloat16_enabled ............. False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_parallel_write_pipeline False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_tag_validation_enabled True
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] checkpoint_tag_validation_fail False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f812de038e0>
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] communication_data_type ...... None
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] curriculum_enabled_legacy .... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] curriculum_params_legacy ..... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] data_efficiency_enabled ...... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dataloader_drop_last ......... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] disable_allgather ............ False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dump_state ................... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_enabled ........... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_gas_boundary_resolution 1
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_layer_num ......... 0
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_max_iter .......... 100
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_stability ......... 1e-06
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_tol ............... 0.01
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] eigenvalue_verbose ........... False
[2023-03-31 03:50:41,804] [INFO] [config.py:1014:print] elasticity_enabled ........... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_auto_cast ............... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_enabled ................. True
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] fp16_master_weights_and_gradients False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] global_rank .................. 0
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] grad_accum_dtype ............. None
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_accumulation_steps .. 1
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_clipping ............ 0.0
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] gradient_predivide_factor .... 1.0
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] initial_dynamic_scale ........ 4294967296
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] load_universal_checkpoint .... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] loss_scale ................... 0
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] memory_breakdown ............. False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_legacy_fusion ...... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_name ............... adam
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] optimizer_params ............. {'lr': 0.00015}
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pld_enabled .................. False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] pld_params ................... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] prescale_gradients ........... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] scheduler_name ............... None
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] scheduler_params ............. None
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] sparse_attention ............. None
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] sparse_gradients_enabled ..... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] steps_per_print .............. 10
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] train_batch_size ............. 8
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] train_micro_batch_size_per_gpu 1
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] use_node_local_storage ....... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] wall_clock_breakdown ......... False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] world_size ................... 8
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_allow_untested_optimizer False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=0 param_persistence_threshold=100 model_persistence_threshold=sys.maxsize max_live_parameters=0 max_reuse_distance=0 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_enabled ................. True
[2023-03-31 03:50:41,805] [INFO] [config.py:1014:print] zero_optimization_stage ...... 3
[2023-03-31 03:50:41,806] [INFO] [config.py:999:print_user_config] json = {
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00015
}
},
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_prefetch_bucket_size": 0,
"stage3_param_persistence_threshold": 100,
"reduce_bucket_size": 100,
"sub_group_size": 1.000000e+08,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004968643188476562 seconds
Training with DeepSpeed
Start Actor Model Pretraining
Looking for checkpoints...
No previous checkpoint found at /home/ubuntu/ubuntu/pyllama_data1/7B/checkpoints/actor for llama-7B.pt
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:3015: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
warnings.warn(
Epoch: 1/1, Iteration: 1/160782, Training Loss: nanEpoch: 1/1, Iteration: 1/160782, Training Loss: nan

Epoch: 1/1, Iteration: 1/160782, Training Loss: nan
Epoch: 1/1, Iteration: 1/160782, Training Loss: nanEpoch: 1/1, Iteration: 1/160782, Training Loss: nan

Epoch: 1/1, Iteration: 1/160782, Training Loss: nan
Epoch: 1/1, Iteration: 1/160782, Training Loss: nan
[2023-03-31 03:50:48,116] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296
Epoch: 1/1, Iteration: 1/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nanEpoch: 1/1, Iteration: 2/160782, Training Loss: nan

Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
[2023-03-31 03:50:51,973] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 2/160782, Training Loss: nan
Epoch: 1/1, Iteration: 3/160782, Training Loss: nanEpoch: 1/1, Iteration: 3/160782, Training Loss: nan

Epoch: 1/1, Iteration: 3/160782, Training Loss: nanEpoch: 1/1, Iteration: 3/160782, Training Loss: nan[2023-03-31 03:50:55,374] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0

Epoch: 1/1, Iteration: 3/160782, Training Loss: nan
Epoch: 1/1, Iteration: 3/160782, Training Loss: nan
Epoch: 1/1, Iteration: 3/160782, Training Loss: nan
Epoch: 1/1, Iteration: 3/160782, Training Loss: nan
Epoch: 1/1, Iteration: 4/160782, Training Loss: nan
Epoch: 1/1, Iteration: 4/160782, Training Loss: nan
Epoch: 1/1, Iteration: 4/160782, Training Loss: nanEpoch: 1/1, Iteration: 4/160782, Training Loss: nan

Epoch: 1/1, Iteration: 4/160782, Training Loss: nan
Epoch: 1/1, Iteration: 4/160782, Training Loss: nanEpoch: 1/1, Iteration: 4/160782, Training Loss: nan

[2023-03-31 03:50:58,870] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
Epoch: 1/1, Iteration: 4/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nanEpoch: 1/1, Iteration: 5/160782, Training Loss: nan

[2023-03-31 03:51:02,331] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
Epoch: 1/1, Iteration: 5/160782, Training Loss: nan
[2023-03-31 03:51:05,827] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
Epoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 6/160782, Training Loss: nanEpoch: 1/1, Iteration: 6/160782, Training Loss: nan

Epoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 6/160782, Training Loss: nanEpoch: 1/1, Iteration: 6/160782, Training Loss: nan

Epoch: 1/1, Iteration: 6/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
[2023-03-31 03:51:09,498] [INFO] [stage3.py:1843:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Epoch: 1/1, Iteration: 7/160782, Training Loss: nan
Traceback (most recent call last):
File "artifacts/main.py", line 61, in
actor_trainer.train()
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 340, in train
est_output = self.model_engine(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1832, in forward
loss = self.module(*inputs, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7f8153aad550>", line 51, in forward
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 43, in forward
model_output = self.model.forward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/llama_model.py", line 512, in forward
logits = self._forward(tokens, attention_mask)
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/llama_model.py", line 552, in _forward
h, _, _ = layer(h, kv_mask, freqs_cis)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/llama_model.py", line 438, in forward
attn, cache_k, cache_v = self.attention.forward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/chatllama/llama_model.py", line 330, in forward
scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 31.75 GiB total capacity; 28.66 GiB already allocated; 41.94 MiB free; 29.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF