[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.named_parameters() giving tensors of shape 0 with DeepSpeed CPU offloading #83

Closed
karthikmurugadoss opened this issue Feb 13, 2023 · 3 comments

Comments

@karthikmurugadoss
Copy link
karthikmurugadoss commented Feb 13, 2023

When I run python examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py, the following message gets printed as expected on the screen for model.print_trainable_parameters() (Line 219)

trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126

However, when I follow the instructions on the README and set up Accelerate with DeepSpeed CPU offloading, the same line now outputs the following

trainable params: 3932160 || all params: 3932160 || trainable%: 100.0

Upon digging a little deeper, it looks like the model.named_parameters() returns tensors of size 0 (except for the Lora A and B matrices) when running with Accelerate and DeepSpeed CPU offloading

This is the original output for the model.named_parameters() - showing only the top few parameters

base_model.model.transformer.word_embeddings.weight     torch.Size([250880, 4096])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([4096])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([4096])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([12288, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([12288])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([4096, 4096])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([16384, 4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([16384])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([4096, 16384])

This is the output when running with accelerate and DeepSpeed CPU offloading

base_model.model.transformer.word_embeddings.weight     torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([0])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.bias torch.Size([0])

It doesn't look like this impacts fine-tuning but was curious why this is happening!

@pacman100
Copy link
Contributor

Hello @karthikmurugadoss, yes, model.print_trainable_parameters() is currently a bit confusing when using DS ZeRO3 because of flattening and sharding of params across the GPUs, the reason for shape 0 tensors. Please refer https://huggingface.co/docs/transformers/main_classes/deepspeed#gathering-parameters for more information

@karthikmurugadoss
Copy link
Author

Got it thanks @pacman100

@dumpmemory
Copy link
Contributor

same here. but the shape 0 is raised from

in_features, out_features = target.weight.shape
. The shape is 0 with gpt2 model and deepspeed init

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants