华为NPU训练不了，用的例子里的训练脚本，镜像也是官方镜像 #4610

apachemycat · 2024-06-28T13:22:18Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3.dev0
Platform: Linux-4.19.36-vhulk1907.1.0.h1438.eulerosv2r8.aarch64-aarch64-with-glibc2.34
Python version: 3.9.9
PyTorch version: 2.1.0 (NPU)
Transformers version: 4.42.0
Datasets version: 2.20.0
Accelerate version: 0.31.0
PEFT version: 0.11.1
TRL version: 0.9.4
NPU type: Ascend910B
CANN version: 8.0.RC2.alpha003

Reproduction

llamafactory-cli train /models/llama-factory-llama3-train/llama3_lora_sft.yaml

RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-06-28-13:05:55.631.066 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op Cast does not has any binary.
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.

[ERROR] 2024-06-28-13:05:55 (PID:11591, Device:0, RankID:-1) ERR01005 OPS internal error

Expected behavior

正常训练

Others

No response

The text was updated successfully, but these errors were encountered:

apachemycat · 2024-06-28T13:25:27Z

06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,v_proj,down_proj,up_proj,q_proj,k_proj,gate_proj
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/LLaMA-Factory/src/llamafactory/model/loader.py", line 161, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 310, in init_adapter
model = _setup_lora_tuning(
File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 265, in _setup_lora_tuning
param.data = param.data.to(torch.float32)
RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-06-28-13:24:50.334.697 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op Cast does not has any binary.
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.

MengqingCao · 2024-06-29T01:10:50Z

请问你用的哪个镜像，这个问题应该是少算子包

shink · 2024-06-29T01:29:35Z

麻烦提供下镜像信息，感谢

apachemycat · 2024-06-30T03:18:17Z

多谢
解决了，问了华为技术人员，npu-smi 23.rc 系列的驱动不能用CANN version: 8.0.的，
升级到npu-smi 24.1.rc1 是可以配套CANN version: 8.0.RC的，
此时，你们提供的那个NPU镜像里，基础镜像需要对应改为 FROM cosdt/cann:8.0.rc1-910-openeuler22.03
就配套NPU驱动了

github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Jun 28, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 30, 2024

hiyouga closed this as completed Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

华为NPU训练不了，用的例子里的训练脚本，镜像也是官方镜像 #4610

华为NPU训练不了，用的例子里的训练脚本，镜像也是官方镜像 #4610

华为NPU训练不了，用的例子里的训练脚本，镜像也是官方镜像 #4610

华为NPU训练不了，用的例子里的训练脚本，镜像也是官方镜像 #4610

Comments

Reminder

System Info

Reproduction

Expected behavior

Others