[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError（pipeline_parallel_degree=2 ， tensor_parallel_degree=2） #8579

hjx620 · 2024-06-11T04:11:56Z

软件环境

- paddlepaddle-gpu: 0.0.0.post120
- paddlenlp: 2.8.0

重复问题

I have searched the existing issues

错误描述

错误1：
Traceback (most recent call last):
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module>
    main()
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train
    return self._inner_training_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate
    output = eval_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step
    return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step
    loss = model.eval_batch([inputs, labels], compute_loss=True)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 749, in eval_batch
    output_tensor = self._forward_step(input_tensor, micro_dataset)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 798, in _forward_step
    output_tensor = self._layers.forward(input_tensor, chunk_id=chunk_id)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 809, in forward
    input = self.forward_function(0, len(self.run_function))(input)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py", line 785, in execute_func
    x = layer(x)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 116, in forward
    input_ids, attention_mask, position_ids, alibi = parse_args(args)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling_pp.py", line 62, in parse_args
    if position_ids is not None:
UnboundLocalError: local variable 'position_ids' referenced before assignment

错误2：
Traceback (most recent call last):
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 630, in <module>
    main()
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 608, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 808, in train
    return self._inner_training_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1192, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, epoch, ignore_keys_for_eval, inputs=inputs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 1476, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/LAB/huangjx/PaddleNLP-develop/PaddleNLP-develop/llm/run_pretrain.py", line 346, in evaluate
    output = eval_loop(
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 2872, in evaluation_loop
    loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3091, in prediction_step
    return self.prediction_pipeline_step(model, inputs, prediction_loss_only, ignore_keys)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/trainer/trainer.py", line 3050, in prediction_pipeline_step
    loss = model.eval_batch([inputs, labels], compute_loss=True)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 767, in eval_batch
    output_tensor = self._forward_step(input_tensor, micro_dataset)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py", line 809, in _forward_step
    loss_tensor = loss_fn(output_tensor, labels)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/LAB/huangjx/.local/lib/python3.10/site-packages/paddlenlp/transformers/llama/modeling.py", line 1651, in forward
    masked_lm_loss = self.loss_func(prediction_scores.astype("float32"), masked_lm_labels.unsqueeze(2))
AttributeError: 'tuple' object has no attribute 'unsqueeze'

稳定复现步骤 & 代码

pretrain_llama2.json文件内容 :
{
"model_name_or_path": "meta-llama/Llama-2-7b",
"tokenizer_name_or_path": "meta-llama/Llama-2-7b",
"input_dir": "./data",
"output_dir": "./checkpoints/llama2_pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 1,
"per_device_eval_batch_size": 1,
"tensor_parallel_degree": 2,
"pipeline_parallel_degree": 2,
"sharding": "stage1",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
"use_flash_attention": false,
"use_fused_rms_norm": false,
"use_fused_rope": false,
"max_seq_length": 4096,
"learning_rate": 3e-05,
"min_learning_rate": 3e-06,
"warmup_steps": 10,
"logging_steps": 1,
"max_steps": 3,
"save_steps": 3,
"eval_steps": 3,
"weight_decay": 0.01,
"bf16": false,
"fp16_opt_level": "O2",
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"continue_training": 1,
"do_train": true,
"do_eval": true,
"do_predict": false,
"disable_tqdm": true,
"recompute": true,
"distributed_dataloader": 1,
"recompute_granularity": "full",
"save_total_limit": 2
}

使用两机，八卡。每台机器机器上有4张V-100.

启动命令：
机器1：
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json
机器2：
python3 -m paddle.distributed.launch --gpus "0,1,2,3" --master=192.168.5.32:49176 --nnodes=2 run_pretrain.py ./llama/pretrain_llama2.json

ZHUI · 2024-06-11T09:40:00Z

你是不是改了模型的输入？看着像是数据输入没有模型需要的数据对应上。

hjx620 · 2024-06-11T12:24:23Z

没改模型输入，数据集就是官网提供的，放在./data下面了

hjx620 · 2024-06-11T12:31:48Z

而且如果我pipeline_parallel_degree=1，就不会报错，能训成功

hjx620 added the bug Something isn't working label Jun 11, 2024

paddle-bot bot assigned KB-Ding Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError（pipeline_parallel_degree=2 ， tensor_parallel_degree=2） #8579

[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError（pipeline_parallel_degree=2 ， tensor_parallel_degree=2） #8579

[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError（pipeline_parallel_degree=2 ， tensor_parallel_degree=2） #8579

[Bug]: 使用PaddleNLP跑 llama2-7b 预训练时报错 AttributeError 和 UnboundLocalError（pipeline_parallel_degree=2 ， tensor_parallel_degree=2） #8579

Comments

软件环境

重复问题

错误描述

稳定复现步骤 & 代码