[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8卡A800全参数预训练GLM4-9B-base,使用bf16,loss在暴涨后突然消失 #4597

Open
1 task done
lclcjj opened this issue Jun 28, 2024 · 0 comments
Open
1 task done
Labels
pending This problem is yet to be addressed

Comments

@lclcjj
Copy link
lclcjj commented Jun 28, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

8卡A800,使用requirements中的库

Reproduction

deepspeed --master_port=9903 --num_gpus 8 src/train.py
--deepspeed ds_zero2_no_offload.json
--stage pt
--do_train
--model_name_or_path /gemini/data-2/glm-4-9b/
--dataset mixed
--finetuning_type full
--output_dir /model_output/lc_output/xunzi_glm4_9b
--per_device_train_batch_size 4
--gradient_accumulation_steps 32
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 1000
--learning_rate 3e-5
--num_train_epochs 1.0
--ddp_timeout 300000000
--plot_loss
--overwrite_output_dir
--overwrite_cache
--cache_dir /model_output/lc_lm_data/xunzi_glm4_dataset_cache/
--cutoff_len 2048
--preprocessing_num_workers 64
--bf16 > logguwen_glm4_9b_full.log 2>&1 & echo $! > run.pid

[INFO|trainer.py:2078] 2024-06-27 19:27:04,713 >> ***** Running training *****
[INFO|trainer.py:2079] 2024-06-27 19:27:04,713 >> Num examples = 1,623,370
[INFO|trainer.py:2080] 2024-06-27 19:27:04,713 >> Num Epochs = 1
[INFO|trainer.py:2081] 2024-06-27 19:27:04,713 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2084] 2024-06-27 19:27:04,713 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024
[INFO|trainer.py:2085] 2024-06-27 19:27:04,713 >> Gradient Accumulation steps = 32
[INFO|trainer.py:2086] 2024-06-27 19:27:04,713 >> Total optimization steps = 1,585
[INFO|trainer.py:2087] 2024-06-27 19:27:04,714 >> Number of trainable parameters = 9,399,951,360

0%| | 0/1585 [00:00<?, ?it/s]
0%| | 1/1585 [03:23<89:39:39, 203.77s/it]
0%| | 2/1585 [06:40<87:46:42, 199.62s/it]
0%| | 3/1585 [09:57<87:06:39, 198.23s/it]
0%| | 4/1585 [13:13<86:40:02, 197.34s/it]
0%| | 5/1585 [16:29<86:29:17, 197.06s/it]

{'loss': 1216724077012582.5, 'grad_norm': nan, 'learning_rate': 2.9999263387762184e-05, 'epoch': 0.0}

0%| | 5/1585 [16:29<86:29:17, 197.06s/it]
0%|▏ | 6/1585 [19:46<86:22:26, 196.93s/it]
0%|▏ | 7/1585 [23:03<86:18:03, 196.88s/it]
1%|▏ | 8/1585 [26:19<86:12:04, 196.78s/it]
1%|▏ | 9/1585 [29:36<86:09:50, 196.82s/it]
1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.999705362339508e-05, 'epoch': 0.01}

1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it]
1%|▏ | 11/1585 [36:09<85:54:36, 196.49s/it]
1%|▎ | 12/1585 [39:24<85:46:03, 196.29s/it]
1%|▎ | 13/1585 [42:41<85:44:46, 196.37s/it]
1%|▎ | 14/1585 [45:57<85:39:01, 196.27s/it]
1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9993370923930618e-05, 'epoch': 0.01}

1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it]
1%|▎ | 16/1585 [52:29<85:30:31, 196.20s/it]
1%|▍ | 17/1585 [55:47<85:36:36, 196.55s/it]
1%|▍ | 18/1585 [59:03<85:35:07, 196.62s/it]
1%|▍ | 19/1585 [1:02:21<85:40:27, 196.95s/it]
1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9988215651065e-05, 'epoch': 0.01}

1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it]
1%|▍ | 21/1585 [1:08:54<85:30:34, 196.83s/it]
1%|▍ | 22/1585 [1:12:11<85:24:27, 196.72s/it]
1%|▍ | 23/1585 [1:15:27<85:19:02, 196.63s/it]
2%|▌ | 24/1585 [1:18:44<85:14:31, 196.59s/it]
2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9981588311123172e-05, 'epoch': 0.02}

2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it]
2%|▌ | 26/1585 [1:25:18<85:13:16, 196.79s/it]
2%|▌ | 27/1585 [1:28:35<85:11:20, 196.84s/it]
2%|▌ | 28/1585 [1:31:52<85:09:44, 196.91s/it]
2%|▌ | 29/1585 [1:35:08<85:02:41, 196.76s/it]
2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9973489555009086e-05, 'epoch': 0.02}

2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it]
2%|▋ | 31/1585 [1:41:42<84:56:43, 196.78s/it]
2%|▋ | 32/1585 [1:44:59<84:56:06, 196.89s/it]
2%|▋ | 33/1585 [1:48:16<84:57:25, 197.07s/it]
2%|▋ | 34/1585 [1:51:33<84:50:20, 196.92s/it]
2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9963920178141794e-05, 'epoch': 0.02}

2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it]
2%|▊ | 36/1585 [1:58:06<84:40:29, 196.79s/it]
2%|▊ | 37/1585 [2:01:23<84:37:05, 196.79s/it]
2%|▊ | 38/1585 [2:04:39<84:28:56, 196.60s/it]
2%|▊ | 39/1585 [2:07:56<84:28:48, 196.72s/it]
3%|▊ | 40/1585 [2:11:12<84:21:26, 196.56s/it]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9952881120377314e-05, 'epoch': 0.03}

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant