[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate DDP integrate #23151

Merged
merged 11 commits into from
May 31, 2023
Merged

accelerate DDP integrate #23151

merged 11 commits into from
May 31, 2023

Conversation

pacman100
Copy link
Contributor

What does this PR do?

  1. Move DDP preparation to Accelerate.
  2. This PR should be merged after accelerate mixed precision integrate #23148
  3. No user-facing change. Now, user can use accelerate launch for DDP and MP, e.g.,
accelerate launch --num_processes 2 --multi_gpu --mixed_precision "bf16" run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir

Previous way of using torchrun works as usual:

torchrun --nnodes 1 --nproc-per-node 2 run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 16   --learning_rate 5e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --bf16

Empirical nuances that I noticed:

  1. As DDP uses Accelerate, the LR Scheduler is run num_processes per step. Previously, it was only run once per step. Because of this, lr decreases rapidly when using Accelerate's integration. In the above example, I had to increase LR from 2e-5 to 5e-5 to account for this behaviour in order to maintain the performance.

@HuggingFaceDocBuilderDev
Copy link
HuggingFaceDocBuilderDev commented May 4, 2023

The documentation is not available anymore as the PR was closed or merged.

@pacman100 pacman100 changed the base branch from main to smangrul/accelerate-mp-integrate May 4, 2023 13:23
Copy link
Collaborator
@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, just one comment!

src/transformers/trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor
@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@pacman100 pacman100 changed the base branch from smangrul/accelerate-mp-integrate to main May 10, 2023 04:18
@pacman100 pacman100 changed the base branch from main to smangrul/accelerate-mp-integrate May 10, 2023 04:18
Base automatically changed from smangrul/accelerate-mp-integrate to main May 31, 2023 06:57
@pacman100 pacman100 merged commit 1cf148a into main May 31, 2023
@pacman100 pacman100 deleted the smangrul/accelerate-ddp-integrate branch May 31, 2023 08:12
@pacman100 pacman100 changed the title Smangrul/accelerate ddp integrate accelerate DDP integrate May 31, 2023
sheonhan pushed a commit to sheonhan/transformers that referenced this pull request Jun 1, 2023
* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments
gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023
* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants