We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8卡A800,使用requirements中的库
deepspeed --master_port=9903 --num_gpus 8 src/train.py --deepspeed ds_zero2_no_offload.json --stage pt --do_train --model_name_or_path /gemini/data-2/glm-4-9b/ --dataset mixed --finetuning_type full --output_dir /model_output/lc_output/xunzi_glm4_9b --per_device_train_batch_size 4 --gradient_accumulation_steps 32 --lr_scheduler_type cosine --logging_steps 5 --save_steps 1000 --learning_rate 3e-5 --num_train_epochs 1.0 --ddp_timeout 300000000 --plot_loss --overwrite_output_dir --overwrite_cache --cache_dir /model_output/lc_lm_data/xunzi_glm4_dataset_cache/ --cutoff_len 2048 --preprocessing_num_workers 64 --bf16 > logguwen_glm4_9b_full.log 2>&1 & echo $! > run.pid
[INFO|trainer.py:2078] 2024-06-27 19:27:04,713 >> ***** Running training ***** [INFO|trainer.py:2079] 2024-06-27 19:27:04,713 >> Num examples = 1,623,370 [INFO|trainer.py:2080] 2024-06-27 19:27:04,713 >> Num Epochs = 1 [INFO|trainer.py:2081] 2024-06-27 19:27:04,713 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2084] 2024-06-27 19:27:04,713 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024 [INFO|trainer.py:2085] 2024-06-27 19:27:04,713 >> Gradient Accumulation steps = 32 [INFO|trainer.py:2086] 2024-06-27 19:27:04,713 >> Total optimization steps = 1,585 [INFO|trainer.py:2087] 2024-06-27 19:27:04,714 >> Number of trainable parameters = 9,399,951,360
0%| | 0/1585 [00:00<?, ?it/s] 0%| | 1/1585 [03:23<89:39:39, 203.77s/it] 0%| | 2/1585 [06:40<87:46:42, 199.62s/it] 0%| | 3/1585 [09:57<87:06:39, 198.23s/it] 0%| | 4/1585 [13:13<86:40:02, 197.34s/it] 0%| | 5/1585 [16:29<86:29:17, 197.06s/it]
{'loss': 1216724077012582.5, 'grad_norm': nan, 'learning_rate': 2.9999263387762184e-05, 'epoch': 0.0}
0%| | 5/1585 [16:29<86:29:17, 197.06s/it] 0%|▏ | 6/1585 [19:46<86:22:26, 196.93s/it] 0%|▏ | 7/1585 [23:03<86:18:03, 196.88s/it] 1%|▏ | 8/1585 [26:19<86:12:04, 196.78s/it] 1%|▏ | 9/1585 [29:36<86:09:50, 196.82s/it] 1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.999705362339508e-05, 'epoch': 0.01}
1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it] 1%|▏ | 11/1585 [36:09<85:54:36, 196.49s/it] 1%|▎ | 12/1585 [39:24<85:46:03, 196.29s/it] 1%|▎ | 13/1585 [42:41<85:44:46, 196.37s/it] 1%|▎ | 14/1585 [45:57<85:39:01, 196.27s/it] 1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9993370923930618e-05, 'epoch': 0.01}
1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it] 1%|▎ | 16/1585 [52:29<85:30:31, 196.20s/it] 1%|▍ | 17/1585 [55:47<85:36:36, 196.55s/it] 1%|▍ | 18/1585 [59:03<85:35:07, 196.62s/it] 1%|▍ | 19/1585 [1:02:21<85:40:27, 196.95s/it] 1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9988215651065e-05, 'epoch': 0.01}
1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it] 1%|▍ | 21/1585 [1:08:54<85:30:34, 196.83s/it] 1%|▍ | 22/1585 [1:12:11<85:24:27, 196.72s/it] 1%|▍ | 23/1585 [1:15:27<85:19:02, 196.63s/it] 2%|▌ | 24/1585 [1:18:44<85:14:31, 196.59s/it] 2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9981588311123172e-05, 'epoch': 0.02}
2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it] 2%|▌ | 26/1585 [1:25:18<85:13:16, 196.79s/it] 2%|▌ | 27/1585 [1:28:35<85:11:20, 196.84s/it] 2%|▌ | 28/1585 [1:31:52<85:09:44, 196.91s/it] 2%|▌ | 29/1585 [1:35:08<85:02:41, 196.76s/it] 2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9973489555009086e-05, 'epoch': 0.02}
2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it] 2%|▋ | 31/1585 [1:41:42<84:56:43, 196.78s/it] 2%|▋ | 32/1585 [1:44:59<84:56:06, 196.89s/it] 2%|▋ | 33/1585 [1:48:16<84:57:25, 197.07s/it] 2%|▋ | 34/1585 [1:51:33<84:50:20, 196.92s/it] 2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9963920178141794e-05, 'epoch': 0.02}
2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it] 2%|▊ | 36/1585 [1:58:06<84:40:29, 196.79s/it] 2%|▊ | 37/1585 [2:01:23<84:37:05, 196.79s/it] 2%|▊ | 38/1585 [2:04:39<84:28:56, 196.60s/it] 2%|▊ | 39/1585 [2:07:56<84:28:48, 196.72s/it] 3%|▊ | 40/1585 [2:11:12<84:21:26, 196.56s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9952881120377314e-05, 'epoch': 0.03}
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
System Info
8卡A800,使用requirements中的库
Reproduction
deepspeed --master_port=9903 --num_gpus 8 src/train.py
--deepspeed ds_zero2_no_offload.json
--stage pt
--do_train
--model_name_or_path /gemini/data-2/glm-4-9b/
--dataset mixed
--finetuning_type full
--output_dir /model_output/lc_output/xunzi_glm4_9b
--per_device_train_batch_size 4
--gradient_accumulation_steps 32
--lr_scheduler_type cosine
--logging_steps 5
--save_steps 1000
--learning_rate 3e-5
--num_train_epochs 1.0
--ddp_timeout 300000000
--plot_loss
--overwrite_output_dir
--overwrite_cache
--cache_dir /model_output/lc_lm_data/xunzi_glm4_dataset_cache/
--cutoff_len 2048
--preprocessing_num_workers 64
--bf16 > logguwen_glm4_9b_full.log 2>&1 & echo $! > run.pid
[INFO|trainer.py:2078] 2024-06-27 19:27:04,713 >> ***** Running training *****
[INFO|trainer.py:2079] 2024-06-27 19:27:04,713 >> Num examples = 1,623,370
[INFO|trainer.py:2080] 2024-06-27 19:27:04,713 >> Num Epochs = 1
[INFO|trainer.py:2081] 2024-06-27 19:27:04,713 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2084] 2024-06-27 19:27:04,713 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024
[INFO|trainer.py:2085] 2024-06-27 19:27:04,713 >> Gradient Accumulation steps = 32
[INFO|trainer.py:2086] 2024-06-27 19:27:04,713 >> Total optimization steps = 1,585
[INFO|trainer.py:2087] 2024-06-27 19:27:04,714 >> Number of trainable parameters = 9,399,951,360
0%| | 0/1585 [00:00<?, ?it/s]
0%| | 1/1585 [03:23<89:39:39, 203.77s/it]
0%| | 2/1585 [06:40<87:46:42, 199.62s/it]
0%| | 3/1585 [09:57<87:06:39, 198.23s/it]
0%| | 4/1585 [13:13<86:40:02, 197.34s/it]
0%| | 5/1585 [16:29<86:29:17, 197.06s/it]
{'loss': 1216724077012582.5, 'grad_norm': nan, 'learning_rate': 2.9999263387762184e-05, 'epoch': 0.0}
0%| | 5/1585 [16:29<86:29:17, 197.06s/it]
0%|▏ | 6/1585 [19:46<86:22:26, 196.93s/it]
0%|▏ | 7/1585 [23:03<86:18:03, 196.88s/it]
1%|▏ | 8/1585 [26:19<86:12:04, 196.78s/it]
1%|▏ | 9/1585 [29:36<86:09:50, 196.82s/it]
1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.999705362339508e-05, 'epoch': 0.01}
1%|▏ | 10/1585 [32:53<86:04:39, 196.75s/it]
1%|▏ | 11/1585 [36:09<85:54:36, 196.49s/it]
1%|▎ | 12/1585 [39:24<85:46:03, 196.29s/it]
1%|▎ | 13/1585 [42:41<85:44:46, 196.37s/it]
1%|▎ | 14/1585 [45:57<85:39:01, 196.27s/it]
1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9993370923930618e-05, 'epoch': 0.01}
1%|▎ | 15/1585 [49:13<85:33:18, 196.18s/it]
1%|▎ | 16/1585 [52:29<85:30:31, 196.20s/it]
1%|▍ | 17/1585 [55:47<85:36:36, 196.55s/it]
1%|▍ | 18/1585 [59:03<85:35:07, 196.62s/it]
1%|▍ | 19/1585 [1:02:21<85:40:27, 196.95s/it]
1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9988215651065e-05, 'epoch': 0.01}
1%|▍ | 20/1585 [1:05:38<85:34:45, 196.86s/it]
1%|▍ | 21/1585 [1:08:54<85:30:34, 196.83s/it]
1%|▍ | 22/1585 [1:12:11<85:24:27, 196.72s/it]
1%|▍ | 23/1585 [1:15:27<85:19:02, 196.63s/it]
2%|▌ | 24/1585 [1:18:44<85:14:31, 196.59s/it]
2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9981588311123172e-05, 'epoch': 0.02}
2%|▌ | 25/1585 [1:22:00<85:11:14, 196.59s/it]
2%|▌ | 26/1585 [1:25:18<85:13:16, 196.79s/it]
2%|▌ | 27/1585 [1:28:35<85:11:20, 196.84s/it]
2%|▌ | 28/1585 [1:31:52<85:09:44, 196.91s/it]
2%|▌ | 29/1585 [1:35:08<85:02:41, 196.76s/it]
2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9973489555009086e-05, 'epoch': 0.02}
2%|▋ | 30/1585 [1:38:25<84:59:39, 196.77s/it]
2%|▋ | 31/1585 [1:41:42<84:56:43, 196.78s/it]
2%|▋ | 32/1585 [1:44:59<84:56:06, 196.89s/it]
2%|▋ | 33/1585 [1:48:16<84:57:25, 197.07s/it]
2%|▋ | 34/1585 [1:51:33<84:50:20, 196.92s/it]
2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9963920178141794e-05, 'epoch': 0.02}
2%|▊ | 35/1585 [1:54:50<84:46:23, 196.89s/it]
2%|▊ | 36/1585 [1:58:06<84:40:29, 196.79s/it]
2%|▊ | 37/1585 [2:01:23<84:37:05, 196.79s/it]
2%|▊ | 38/1585 [2:04:39<84:28:56, 196.60s/it]
2%|▊ | 39/1585 [2:07:56<84:28:48, 196.72s/it]
3%|▊ | 40/1585 [2:11:12<84:21:26, 196.56s/it]
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.9952881120377314e-05, 'epoch': 0.03}
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: