-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
华为NPU训练不了,用的例子里的训练脚本,镜像也是官方镜像 #4610
Comments
06/28/2024 13:24:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled. |
请问你用的哪个镜像,这个问题应该是少算子包 |
麻烦提供下镜像信息,感谢 |
多谢 |
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
llamafactory-cli train /models/llama-factory-llama3-train/llama3_lora_sft.yaml
RuntimeError: call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-06-28-13:05:55.631.066 Parse dynamic kernel config fail.
TraceBack (most recent call last):
AclOpKernelInit failed opType
Op Cast does not has any binary.
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.
[ERROR] 2024-06-28-13:05:55 (PID:11591, Device:0, RankID:-1) ERR01005 OPS internal error
Expected behavior
正常训练
Others
No response
The text was updated successfully, but these errors were encountered: