[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError(pipeline_parallel_degree=2 , tensor_parallel_degree=2) #8579

Open
1 task done
hjx620 opened this issue Jun 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@hjx620
Copy link
hjx620 commented Jun 11, 2024

软件环境

- paddlepaddle-gpu: 0.0.0.post120
- paddlenlp: 2.8.0

重复问题

  • I have searched the existing issues

错误描述

错误1Traceback (most recent call last):
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module>
    main()
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train
    return self._inner_training_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate
    output = eval_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step
    return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step
    loss = model.eval_batch([inputs, labels], compute_loss=True)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 749, in eval_batch
    output_tensor = self._forward_step(input_tensor, micro_dataset)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 798, in _forward_step
    output_tensor = self._layers.forward(input_tensor, chunk_id=chunk_id)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 809, in forward
    input = self.forward_function(0, len(self.run_function))(input)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 785, in execute_func
    x = layer(x)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 116, in forward
    input_ids, attention_mask, position_ids, alibi = parse_args(args)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 62, in parse_args
    if position_ids is not None:
UnboundLocalError: local variable 'position_ids' referenced before assignment

错误2Traceback (most recent call last):
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module>
    main()
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train
    return self._inner_training_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate
    output = eval_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step
    return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step
    loss = model.eval_batch([inputs, labels], compute_loss=True)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 767, in eval_batch
    output_tensor = self._forward_step(input_tensor, micro_dataset)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 809, in _forward_step
    loss_tensor = loss_fn(output_tensor, labels)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling.py", line 1651, in forward
    masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
AttributeError: 'tuple' object has no attribute 'unsqueeze'

稳定复现步骤 & 代码

pretrain_llama2.json文件内容 :
{
"model_name_or_path": "meta-llama/Llama-2-7b",
"tokenizer_name_or_path": "meta-llama/Llama-2-7b",
"input_dir": "./data",
"output_dir": "./checkpoints/llama2_pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 1,
"tensor_parallel_degree": 2,
"pipeline_parallel_degree": 2,
"sharding": "stage1",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
"use_flash_attention": false,
"use_fused_rms_norm": false,
"use_fused_rope": false,
"max_seq_length": 4096,
"learning_rate": 3e-05,
"min_learning_rate": 3e-06,
"warmup_steps": 10,
"logging_steps": 1,
"max_steps": 3,
"save_steps": 3,
"eval_steps": 3,
"weight_decay": 0.01,
"bf16": false,
"fp16_opt_level": "O2",
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"continue_training": 1,
"do_train": true,
"do_eval": true,
"do_predict": false,
"disable_tqdm": true,
"recompute": true,
"distributed_dataloader": 1,
"recompute_granularity": "full",
"save_total_limit": 2
}

使用两机,八卡。每台机器机器上有4张V-100.

启动命令:
机器1:
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json
机器2:
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json

@hjx620 hjx620 added the bug Something isn't working label Jun 11, 2024
@ZHUI
Copy link
Collaborator
ZHUI commented Jun 11, 2024

你是不是改了 模型的输入? 看着像是 数据 输入没有 模型需要的数据 对应上。

@hjx620
Copy link
Author
hjx620 commented Jun 11, 2024

没改模型输入,数据集就是官网提供的,放在./data下面了
c2c90b81c6427dd34ca0864260ada6e

@hjx620
Copy link
Author
hjx620 commented Jun 11, 2024

而且如果我pipeline_parallel_degree=1,就不会报错,能训成功

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants