[trainer] param count for deepspeed zero3 #22193

stas00 · 2023-03-16T03:26:38Z

As reported in #22179 the trainer code doesn't handle the sharded models correctly in reporting "the Number of trainable parameters" - I'm not sure if FSDP models have the same issue.

This PR fixes this situation with Deepspeed ZeRO3 which otherwise reported a count of 0.

Fixes: #22179

HuggingFaceDocBuilderDev · 2023-03-16T03:40:58Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Let's hope FSDP used the PyTorch API instead of also inventing their own... Thanks for the fix!

stas00 · 2023-03-17T16:11:52Z

An explanation is needed here.

The Deepspeed team had to invent their own tensor substitute since 2 years ago nothing of a kind existed in pytorch. They had to replace tensors with placeholders to be able to support sharded tensors.

The meta tensors were added just recently so they are looking at possibly switching to those.

The API I used in this PR is not public per-se. And the "clean" way would be to gather tensors and then get their normal t.numel() - but this is extremely wasteful and expensive memory and time-wise. So I hacked it to get the internal equivalent to make it almost instant.

I'm not planning on leaving it this way and asking for deepspeed to provide an efficient method to return the sizes w/o me using a non-public API.

There are many other hidden issues wrt this tensor substitution that impacts only ZeRO stage 3 microsoft/DeepSpeed#2650 - and yesterday I have discovered at least one bug in our examples because of that, while debugging the user report that lead to this PR. All examples resize the embedding under zero3 because their check if the vocab is larger than embedding size always returns True, since the embed size is reported to be of size 0, because it's not gathered :(

I'm working on ensuring that the Deepspeed addresses this issue because it's subtle and very problematic.

Please let me know if you're OK with merging this now that you know more details. I can also easily recode it to gather tensors first, but it'd be very inefficient.

sgugger

Fine to merge, thanks a lot for the detailed explanation!

[trainer] param count for zero3

[trainer] param count for zero3

9533eb7

stas00 mentioned this pull request Mar 16, 2023

When I use Trainer with Deepspeed, the Number of trainable parameters is 0 #22179

Closed

stas00 marked this pull request as ready for review March 16, 2023 21:07

stas00 requested a review from sgugger March 16, 2023 21:07

sgugger approved these changes Mar 17, 2023

View reviewed changes

stas00 requested a review from sgugger March 17, 2023 16:15

sgugger approved these changes Mar 17, 2023

View reviewed changes

stas00 merged commit 60d51ef into main Mar 17, 2023

stas00 deleted the ds-trainable-params branch March 17, 2023 18:02

raghavanone pushed a commit to raghavanone/transformers that referenced this pull request Apr 5, 2023

[trainer] param count for deepspeed zero3 (huggingface#22193)

0a9926d

[trainer] param count for zero3

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

[trainer] param count for deepspeed zero3 (huggingface#22193)

ec90cec

[trainer] param count for zero3

apoorvkh mentioned this pull request Jul 20, 2023

Fallback for missing attribute Parameter.ds_numel #24942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] param count for deepspeed zero3 #22193

[trainer] param count for deepspeed zero3 #22193

[trainer] param count for deepspeed zero3 #22193

[trainer] param count for deepspeed zero3 #22193

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment