accelerate DDP integrate #23151

pacman100 · 2023-05-04T12:53:16Z

What does this PR do?

Move DDP preparation to Accelerate.
This PR should be merged after accelerate mixed precision integrate #23148
No user-facing change. Now, user can use accelerate launch for DDP and MP, e.g.,

accelerate launch --num_processes 2 --multi_gpu --mixed_precision "bf16" run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir

Previous way of using torchrun works as usual:

torchrun --nnodes 1 --nproc-per-node 2 run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --bf16

Empirical nuances that I noticed:

As DDP uses Accelerate, the LR Scheduler is run num_processes per step. Previously, it was only run once per step. Because of this, lr decreases rapidly when using Accelerate's integration. In the above example, I had to increase LR from 2e-5 to 5e-5 to account for this behaviour in order to maintain the performance.

HuggingFaceDocBuilderDev · 2023-05-04T13:08:09Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for working on this, just one comment!

src/transformers/trainer.py

muellerzr

Great job!

* mixed precision support via accelerate * fix issues * fix for the sharded ddp case * fix flax and tf failing tests * `refactor the place to create `Accelerator` object * move ddp prep to accelerate * fix 😅 * resolving comments

pacman100 added 7 commits May 4, 2023 13:35

mixed precision support via accelerate

b3987a8

fix issues

862d04b

fix for the sharded ddp case

f2196be

fix flax and tf failing tests

2339a48

refactor the place to create Accelerator` object

263b134

move ddp prep to accelerate

a5bf517

fix 😅

f00ce09

pacman100 requested review from sgugger and muellerzr May 4, 2023 12:53

pacman100 changed the base branch from main to smangrul/accelerate-mp-integrate May 4, 2023 13:23

sgugger reviewed May 4, 2023

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

resolving comments

254f9a4

pacman100 mentioned this pull request May 5, 2023

move fsdp handling to accelerate #23158

Merged

muellerzr approved these changes May 5, 2023

View reviewed changes

sgugger approved these changes May 5, 2023

View reviewed changes

Merge branch 'main' into smangrul/accelerate-ddp-integrate

66573ec

pacman100 changed the base branch from smangrul/accelerate-mp-integrate to main May 10, 2023 04:18

pacman100 changed the base branch from main to smangrul/accelerate-mp-integrate May 10, 2023 04:18

Base automatically changed from smangrul/accelerate-mp-integrate to main May 31, 2023 06:57

pacman100 added 2 commits May 31, 2023 12:28

Merge branch 'main' into smangrul/accelerate-ddp-integrate

49d96fd

Merge branch 'main' into smangrul/accelerate-ddp-integrate

7b59ac8

pacman100 merged commit 1cf148a into main May 31, 2023

pacman100 deleted the smangrul/accelerate-ddp-integrate branch May 31, 2023 08:12

pacman100 changed the title ~~Smangrul/accelerate ddp integrate~~ accelerate DDP integrate May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate DDP integrate #23151

accelerate DDP integrate #23151

accelerate DDP integrate #23151

accelerate DDP integrate #23151

Conversation

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment