The weights saved during training using deepspeed zero stage3 are incomplete. #453

bestpredicts · 2023-05-16T03:35:11Z

I used the train_lora.py code provided by the fastchat and trained it using deepspeed zero stage3 without CPU offload. However, I found that the saved bin file weights after training were only 3.6m, which cannot be loaded. If non-stage3 training is used, the saved weights should be 16m and can be loaded for inference

deepspeed stage3.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

train_lora.py

# Usage: deepspeed train_lora.py --deepspeed <$PATH_TO_DEEPSPEED_CONFIG>

# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
#    Licensed under the Apache License, Version 2.0 (the "License");
#    you may not use this file except in compliance with the License.
#    You may obtain a copy of the License at
#
#        http://www.apache.org/licenses/LICENSE-2.0
#
#    Unless required by applicable law or agreed to in writing, software
#    distributed under the License is distributed on an "AS IS" BASIS,
#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#    See the License for the specific language governing permissions and
#    limitations under the License.

from dataclasses import dataclass, field
import logging
import pathlib
import typing

from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
from peft import LoraConfig, get_peft_model
import transformers
from transformers import Trainer

from fastchat.train.train import (
    DataArguments,
    ModelArguments,
    TrainingArguments,
    make_supervised_data_module,
)

from fastchat.train.llama_flash_attn_monkey_patch import (
    replace_llama_attn_with_flash_attn,
)

replace_llama_attn_with_flash_attn()


@dataclass
class LoraArguments:
    lora_r: int = 8
    lora_alpha: int = 16
    lora_dropout: float = 0.05
    lora_target_modules: typing.List[str] = field(
        default_factory=lambda: ["q_proj", "v_proj"]
    )
    lora_weight_path: str = ""
    bias: str = "none"


def maybe_zero_3(param):
    if hasattr(param, "ds_id"):
        assert param.ds_status == ZeroParamStatus.NOT_AVAILABLE
        with zero.GatheredParameters([param]):
            param = param.data.cpu().clone().detach()
    return param


# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(state_dict, bias):
    if bias == "none":
        to_return = {
            k: state_dict[k].cpu().clone().detach() for k in state_dict if "lora_" in k
        }
    elif bias == "all":
        to_return = {
            k: state_dict[k] for k in state_dict if "lora_" in k or "bias" in k
        }
    elif bias == "lora_only":
        to_return = {}
        for k in state_dict:
            if "lora_" in k:
                to_return[k] = state_dict[k]
                bias_name = k.split("lora_")[0] + "bias"
                if bias_name in state_dict:
                    to_return[bias_name] = state_dict[bias_name]
    else:
        raise NotImplementedError
    to_return = {k: maybe_zero_3(v) for k, v in to_return.items()}
    return to_return


def train():
    parser = transformers.HfArgumentParser(
        (ModelArguments, DataArguments, TrainingArguments, LoraArguments)
    )
    (
        model_args,
        data_args,
        training_args,
        lora_args,
    ) = parser.parse_args_into_dataclasses()

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )
    lora_config = LoraConfig(
        r=lora_args.lora_r,
        lora_alpha=lora_args.lora_alpha,
        target_modules=lora_args.lora_target_modules,
        lora_dropout=lora_args.lora_dropout,
        bias=lora_args.bias,
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)
    if training_args.deepspeed is not None and training_args.local_rank == 0:
        model.print_trainable_parameters()

    if training_args.gradient_checkpointing:
        logging.warning(
            "gradient checkpointing with lora makes requires_grad "
            "incorrect and needs a monkey patch in Trainer or the "
            "wrapped model's forward. ref: "
            "https://github.com/lm-sys/FastChat/pull/138#issuecomment-1509172198"
        )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
        use_fast=False,
    )
    tokenizer.pad_token = tokenizer.unk_token

    data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
    trainer = Trainer(
        model=model, tokenizer=tokenizer, args=training_args, **data_module
    )

    model.config.use_cache = False

    if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
        trainer.train(resume_from_checkpoint=True)
    else:
        trainer.train()
    trainer.save_state()

    # Save states. Weights might be a placeholder in zero3 and need a gather
    state_dict = get_peft_state_maybe_zero_3(model.state_dict(), lora_args.bias)
    if training_args.local_rank == 0:
        model.save_pretrained(training_args.output_dir, state_dict=state_dict)


if __name__ == "__main__":
    train()

tengxiaoliu · 2023-05-18T14:41:23Z

Same here, when using stage 3 without offload, it only saves a null checkpoint (128K). When I switch to zero stage 2, the saved model is complete (77M).
btw I'm using the callback mentioned in #96 to save the model.

tengxiaoliu · 2023-05-30T13:33:04Z

Instead of saving adapter checkpoint with the callback, I manage to save the full model weight by overwriting _save_checkpoint method in trainer #96 (comment):

from peft import get_peft_model_state_dict

class MyTrainer(Seq2SeqTrainer):

    def _save_checkpoint(self, _, trial, metrics=None):
        """ Don't save base model, optimizer etc.
            but create checkpoint folder (needed for saving adapter) """
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)

        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (self.state.best_metric is None or self.state.best_model_checkpoint is None
                    or operator(metric_value, self.state.best_metric)):
                self.state.best_metric = metric_value

                self.state.best_model_checkpoint = output_dir

        os.makedirs(output_dir, exist_ok=True)

        if self.args.should_save:
            self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

        # save adapter config
        self.model.peft_config.save_pretrained(output_dir)
        # get state dict through deepspeed engine
        engine_state_dict = self.model_wrapped._zero3_consolidated_16bit_state_dict()
        lora_state_dict = get_peft_model_state_dict(self.model, engine_state_dict)
        if self.args.local_rank == 0:
            torch.save(lora_state_dict, os.path.join(output_dir, "adapter_model.bin"))
            print(f"Save adapter model at {output_dir}")

This worked in my case, the trainer can save the full adapter model under zero3.

Nomiizz · 2023-06-01T18:44:22Z

@tengxiaoliu Can these adapter model weights be later merged with the following code:

lora_model = PeftModel.from_pretrained(
            base_model,
            output_dir,
            torch_dtype=torch.float16
        )

model = lora_model.merge_and_unload()

or should I use the lora_model directly for inference?

tengxiaoliu · 2023-06-02T02:10:38Z

@Nomiizz Yes, the saved adapter should be merged with the base model for inference.

github-actions · 2023-06-27T15:03:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

LittleYouEr · 2023-07-20T02:18:36Z

non-valid in my case, i also use zero3 + lora, but the adapte_model.bin is 3.2mb, and when used in inference stage, which says size mismatch

copying a param with shape torch.Size([0]) from checkpoint

any suggestion?

LittleYouEr · 2023-07-20T05:14:46Z

sovled by following codes:

get state dict through deepspeed engine

    engine_state_dict = self.model_wrapped._zero3_consolidated_16bit_state_dict()
    lora_state_dict = get_peft_model_state_dict(self.model, engine_state_dict)
    if self.args.local_rank == 0:
        torch.save(lora_state_dict, os.path.join(output_dir, "adapter_model.bin"))
        print(f"Save adapter model at {output_dir}")

github-actions bot closed this as completed Jul 5, 2023

jayden222 mentioned this issue Aug 2, 2023

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch for <layer> #460

Closed

HqWu-HITCS mentioned this issue Dec 6, 2023

deepspeed(zero3)+lora微调，模型导出问题 hiyouga/LLaMA-Factory#446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The weights saved during training using deepspeed zero stage3 are incomplete. #453

The weights saved during training using deepspeed zero stage3 are incomplete. #453

get state dict through deepspeed engine

The weights saved during training using deepspeed zero stage3 are incomplete. #453

The weights saved during training using deepspeed zero stage3 are incomplete. #453

Comments

get state dict through deepspeed engine