[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[trainer] param count for deepspeed zero3 #22193

Merged
merged 1 commit into from
Mar 17, 2023
Merged

[trainer] param count for deepspeed zero3 #22193

merged 1 commit into from
Mar 17, 2023

Conversation

stas00
Copy link
Contributor
@stas00 stas00 commented Mar 16, 2023

As reported in #22179 the trainer code doesn't handle the sharded models correctly in reporting "the Number of trainable parameters" - I'm not sure if FSDP models have the same issue.

This PR fixes this situation with Deepspeed ZeRO3 which otherwise reported a count of 0.

Fixes: #22179

@HuggingFaceDocBuilderDev
Copy link
HuggingFaceDocBuilderDev commented Mar 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@stas00 stas00 marked this pull request as ready for review March 16, 2023 21:07
@stas00 stas00 requested a review from sgugger March 16, 2023 21:07
Copy link
Collaborator
@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hope FSDP used the PyTorch API instead of also inventing their own... Thanks for the fix!

@stas00
Copy link
Contributor Author
stas00 commented Mar 17, 2023

An explanation is needed here.

The Deepspeed team had to invent their own tensor substitute since 2 years ago nothing of a kind existed in pytorch. They had to replace tensors with placeholders to be able to support sharded tensors.

The meta tensors were added just recently so they are looking at possibly switching to those.

The API I used in this PR is not public per-se. And the "clean" way would be to gather tensors and then get their normal t.numel() - but this is extremely wasteful and expensive memory and time-wise. So I hacked it to get the internal equivalent to make it almost instant.

I'm not planning on leaving it this way and asking for deepspeed to provide an efficient method to return the sizes w/o me using a non-public API.

There are many other hidden issues wrt this tensor substitution that impacts only ZeRO stage 3 microsoft/DeepSpeed#2650 - and yesterday I have discovered at least one bug in our examples because of that, while debugging the user report that lead to this PR. All examples resize the embedding under zero3 because their check if the vocab is larger than embedding size always returns True, since the embed size is reported to be of size 0, because it's not gathered :(

I'm working on ensuring that the Deepspeed addresses this issue because it's subtle and very problematic.

Please let me know if you're OK with merging this now that you know more details. I can also easily recode it to gather tensors first, but it'd be very inefficient.

@stas00 stas00 requested a review from sgugger March 17, 2023 16:15
Copy link
Collaborator
@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to merge, thanks a lot for the detailed explanation!

@stas00 stas00 merged commit 60d51ef into main Mar 17, 2023
@stas00 stas00 deleted the ds-trainable-params branch March 17, 2023 18:02
raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When I use Trainer with Deepspeed, the Number of trainable parameters is 0
3 participants