`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading #83

karthikmurugadoss · 2023-02-13T21:24:12Z

When I run python examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py, the following message gets printed as expected on the screen for model.print_trainable_parameters() (Line 219)

trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126

However, when I follow the instructions on the README and set up Accelerate with DeepSpeed CPU offloading, the same line now outputs the following

trainable params: 3932160 || all params: 3932160 || trainable%: 100.0

Upon digging a little deeper, it looks like the model.named_parameters() returns tensors of size 0 (except for the Lora A and B matrices) when running with Accelerate and DeepSpeed CPU offloading

This is the original output for the model.named_parameters() - showing only the top few parameters

base_model.model.transformer.word_embeddings.weight     torch.Size([250880, 4096])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([4096])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([4096])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([12288, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([12288])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([4096, 4096])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([16384, 4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([16384])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([4096, 16384])

This is the output when running with accelerate and DeepSpeed CPU offloading

base_model.model.transformer.word_embeddings.weight     torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([0])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.bias torch.Size([0])

It doesn't look like this impacts fine-tuning but was curious why this is happening!

The text was updated successfully, but these errors were encountered:

pacman100 · 2023-02-14T05:20:51Z

Hello @karthikmurugadoss, yes, model.print_trainable_parameters() is currently a bit confusing when using DS ZeRO3 because of flattening and sharding of params across the GPUs, the reason for shape 0 tensors. Please refer https://huggingface.co/docs/transformers/main_classes/deepspeed#gathering-parameters for more information

karthikmurugadoss · 2023-02-14T15:25:27Z

Got it thanks @pacman100

dumpmemory · 2023-03-06T07:59:50Z

same here. but the shape 0 is raised from

peft/src/peft/tuners/lora.py

Line 163 in ce4e6f3

in_features, out_features = target.weight.shape

. The shape is 0 with gpt2 model and deepspeed init

karthikmurugadoss closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading #83

`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading #83

model.named_parameters() giving tensors of shape 0 with DeepSpeed CPU offloading #83

model.named_parameters() giving tensors of shape 0 with DeepSpeed CPU offloading #83

Comments

`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading #83

`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading #83