We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
- paddlepaddle-gpu: 0.0.0.post120 - paddlenlp: 2.8.0
错误1: Traceback (most recent call last): File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module> main() File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train return self._inner_training_loop( File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate output = eval_loop( File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step loss = model.eval_batch([inputs, labels], compute_loss=True) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 749, in eval_batch output_tensor = self._forward_step(input_tensor, micro_dataset) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 798, in _forward_step output_tensor = self._layers.forward(input_tensor, chunk_id=chunk_id) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 809, in forward input = self.forward_function(0, len(self.run_function))(input) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 785, in execute_func x = layer(x) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__ return self.forward(*inputs, **kwargs) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 116, in forward input_ids, attention_mask, position_ids, alibi = parse_args(args) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 62, in parse_args if position_ids is not None: UnboundLocalError: local variable 'position_ids' referenced before assignment 错误2: Traceback (most recent call last): File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module> main() File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train return self._inner_training_loop( File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate output = eval_loop( File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step loss = model.eval_batch([inputs, labels], compute_loss=True) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 767, in eval_batch output_tensor = self._forward_step(input_tensor, micro_dataset) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 809, in _forward_step loss_tensor = loss_fn(output_tensor, labels) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__ return self.forward(*inputs, **kwargs) File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling.py", line 1651, in forward masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2)) AttributeError: 'tuple' object has no attribute 'unsqueeze'
pretrain_llama2.json文件内容 : { "model_name_or_path": "meta-llama/Llama-2-7b", "tokenizer_name_or_path": "meta-llama/Llama-2-7b", "input_dir": "./data", "output_dir": "./checkpoints/llama2_pretrain_ckpts", "per_device_train_batch_size": 1, "gradient_accumulation_steps": 1, "per_device_eval_batch_size": 1, "tensor_parallel_degree": 2, "pipeline_parallel_degree": 2, "sharding": "stage1", "virtual_pp_degree": 1, "sequence_parallel": 0, "use_flash_attention": false, "use_fused_rms_norm": false, "use_fused_rope": false, "max_seq_length": 4096, "learning_rate": 3e-05, "min_learning_rate": 3e-06, "warmup_steps": 10, "logging_steps": 1, "max_steps": 3, "save_steps": 3, "eval_steps": 3, "weight_decay": 0.01, "bf16": false, "fp16_opt_level": "O2", "warmup_ratio": 0.01, "max_grad_norm": 1.0, "dataloader_num_workers": 1, "continue_training": 1, "do_train": true, "do_eval": true, "do_predict": false, "disable_tqdm": true, "recompute": true, "distributed_dataloader": 1, "recompute_granularity": "full", "save_total_limit": 2 }
使用两机,八卡。每台机器机器上有4张V-100.
启动命令: 机器1: python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json 机器2: python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json
The text was updated successfully, but these errors were encountered:
你是不是改了 模型的输入? 看着像是 数据 输入没有 模型需要的数据 对应上。
Sorry, something went wrong.
没改模型输入,数据集就是官网提供的,放在./data下面了
而且如果我pipeline_parallel_degree=1,就不会报错,能训成功
KB-Ding
No branches or pull requests
软件环境
重复问题
错误描述
稳定复现步骤 & 代码
pretrain_llama2.json文件内容 :
{
"model_name_or_path": "meta-llama/Llama-2-7b",
"tokenizer_name_or_path": "meta-llama/Llama-2-7b",
"input_dir": "./data",
"output_dir": "./checkpoints/llama2_pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 1,
"tensor_parallel_degree": 2,
"pipeline_parallel_degree": 2,
"sharding": "stage1",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
"use_flash_attention": false,
"use_fused_rms_norm": false,
"use_fused_rope": false,
"max_seq_length": 4096,
"learning_rate": 3e-05,
"min_learning_rate": 3e-06,
"warmup_steps": 10,
"logging_steps": 1,
"max_steps": 3,
"save_steps": 3,
"eval_steps": 3,
"weight_decay": 0.01,
"bf16": false,
"fp16_opt_level": "O2",
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"continue_training": 1,
"do_train": true,
"do_eval": true,
"do_predict": false,
"disable_tqdm": true,
"recompute": true,
"distributed_dataloader": 1,
"recompute_granularity": "full",
"save_total_limit": 2
}
使用两机,八卡。每台机器机器上有4张V-100.
启动命令:
机器1:
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json
机器2:
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json
The text was updated successfully, but these errors were encountered: