-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The weights saved during training using deepspeed zero stage3 are incomplete. #453
Comments
Same here, when using stage 3 without offload, it only saves a null checkpoint (128K). When I switch to zero stage 2, the saved model is complete (77M). |
Instead of saving adapter checkpoint with the callback, I manage to save the full model weight by overwriting
This worked in my case, the trainer can save the full adapter model under zero3. |
@tengxiaoliu Can these adapter model weights be later merged with the following code:
or should I use the lora_model directly for inference? |
@Nomiizz Yes, the saved adapter should be merged with the base model for inference. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
non-valid in my case, i also use zero3 + lora, but the adapte_model.bin is 3.2mb, and when used in inference stage, which says size mismatch copying a param with shape torch.Size([0]) from checkpoint any suggestion? |
sovled by following codes:
|
I used the train_lora.py code provided by the fastchat and trained it using deepspeed zero stage3 without CPU offload. However, I found that the saved bin file weights after training were only 3.6m, which cannot be loaded. If non-stage3 training is used, the saved weights should be 16m and can be loaded for inference
deepspeed stage3.json
train_lora.py
The text was updated successfully, but these errors were encountered: