Implement efficient packing without cross-contamination attention #4224

chuan298 · 2024-06-11T21:10:55Z

What does this PR do?

Update 15/6/2024: Add support packing for eager and sdpa

Implement efficient packing without cross-contamination attention
Taking inspiration from some repository as axolotl and functionary, I applied packing sequences more effectively, enabling the model to learn samples more efficiently without attending to other samples within the same pack. Now I only support this implement for sft with flash_attention_2.

Example training config:

### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
flash_attn: fa2

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
efficient_packing: true

### output
output_dir: saves/llama3-8b/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

Before submitting

Did you read the contributor guideline?

AlongWY · 2024-06-20T09:00:26Z

是否应该考虑使用 varlen_flash_atten 实现?

hiyouga · 2024-06-20T09:20:27Z

src/llamafactory/train/sft/workflow.py

@@ -33,6 +33,9 @@ def run_sft(
    dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)

+    if data_args.efficient_packing:
+        configure_packing(model.config, model_args)


could we do configure_packing in llamafactory.model.patcher?

Sure, I just edited it

hiyouga · 2024-06-20T09:32:48Z

src/llamafactory/extras/constants.py

@@ -66,6 +66,21 @@

 SUPPORTED_CLASS_FOR_S2ATTN = {"llama"}

+SUPPORTED_CLASS_FOR_MULTIPACK = [


is it "efficient_packing" rather than "multipack"?

yes, I just fixed.

chuan298 · 2024-06-20T10:06:12Z

是否应该考虑使用 varlen_flash_atten 实现?

Hi @AlongWY , The models in transformers have used flash_attn_varlen_func by default when passing attention_mask. I just made a slight change to the attention_mask when packing sequences and returned indices, cu_seqlens, and max_seqlen_in_batch corresponding to the modified attention_mask.

implement efficient packing without cross-contamination attention

b2c367b

hiyouga added the pending This problem is yet to be addressed label Jun 12, 2024

hiyouga mentioned this pull request Jun 15, 2024

sft_packing实现的问题 #2289

Open

1 task

ancv added 2 commits June 15, 2024 23:00

remove some unused params

04315c3

update packing with sdpa and eager attention mode

238f5c3

hiyouga reviewed Jun 20, 2024

View reviewed changes

ancv and others added 2 commits June 21, 2024 00:45

move configure_packing to llamafactory.model.patcher and fix constants

770f75d

Merge branch 'main' into main

e8e6af2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement efficient packing without cross-contamination attention #4224

Implement efficient packing without cross-contamination attention #4224

		@@ -66,6 +66,21 @@

		SUPPORTED_CLASS_FOR_S2ATTN = {"llama"}

		SUPPORTED_CLASS_FOR_MULTIPACK = [

Implement efficient packing without cross-contamination attention #4224

Are you sure you want to change the base?

Implement efficient packing without cross-contamination attention #4224

Conversation

What does this PR do?

Update 15/6/2024: Add support packing for eager and sdpa

Before submitting

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment