[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

华为NPU训练不了,用的例子里的训练脚本,镜像也是官方镜像 #4610

Closed
1 task done
apachemycat opened this issue Jun 28, 2024 · 4 comments
Closed
1 task done
Labels
npu This problem is related to NPU devices solved This problem has been already solved

Comments

@apachemycat
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.8.3.dev0
  • Platform: Linux-4.19.36-vhulk1907.1.0.h1438.eulerosv2r8.aarch64-aarch64-with-glibc2.34
  • Python version: 3.9.9
  • PyTorch version: 2.1.0 (NPU)
  • Transformers version: 4.42.0
  • Datasets version: 2.20.0
  • Accelerate version: 0.31.0
  • PEFT version: 0.11.1
  • TRL version: 0.9.4
  • NPU type: Ascend910B
  • CANN version: 8.0.RC2.alpha003

Reproduction

llamafactory-cli train /models/llama-factory-llama3-train/llama3_lora_sft.yaml

RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-06-28-13:05:55.631.066 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op Cast does not has any binary.
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.

[ERROR] 2024-06-28-13:05:55 (PID:11591, Device:0, RankID:-1) ERR01005 OPS internal error

Expected behavior

正常训练

Others

No response

@github-actions github-actions bot added pending This problem is yet to be addressed npu This problem is related to NPU devices labels Jun 28, 2024
@apachemycat
Copy link
Author

06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
06/28/2024 13:24:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,v_proj,down_proj,up_proj,q_proj,k_proj,gate_proj
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/LLaMA-Factory/src/llamafactory/cli.py", line 111, in main
run_exp()
File "/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
File "/LLaMA-Factory/src/llamafactory/model/loader.py", line 161, in load_model
model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 310, in init_adapter
model = _setup_lora_tuning(
File "/LLaMA-Factory/src/llamafactory/model/adapter.py", line 265, in _setup_lora_tuning
param.data = param.data.to(torch.float32)
RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-06-28-13:24:50.334.697 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op Cast does not has any binary.
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.

@MengqingCao
Copy link
Contributor

请问你用的哪个镜像,这个问题应该是少算子包

@shink
Copy link
shink commented Jun 29, 2024

麻烦提供下镜像信息,感谢

@apachemycat
Copy link
Author

多谢
解决了,问了华为技术人员,npu-smi 23.rc 系列的驱动不能用CANN version: 8.0.的,
升级到npu-smi 24.1.rc1 是可以配套CANN version: 8.0.RC的,
此时,你们提供的那个NPU镜像里,基础镜像需要对应改为 FROM cosdt/cann:8.0.rc1-910-openeuler22.03
就配套NPU驱动了

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Jun 30, 2024
@hiyouga hiyouga closed this as completed Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
npu This problem is related to NPU devices solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants