Remove .data usages in optimizations.py #23417

alanwaketan · 2023-05-16T22:45:53Z

What does this PR do?

.data usages is deprecated in recent releases of PyTorch. See pytorch/pytorch#91093 (comment)

This change replace all .data usages in optimizations.py with modern alternatives.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@connor-henderson @stas00

JackCaoG · 2023-05-16T22:49:48Z

@muellerzr the usage of the .data will greatly slow down pytorch/xla on nightly, we were hoping we can fix this issue before the next release.

HuggingFaceDocBuilderDev · 2023-05-16T23:01:03Z

The documentation is not available anymore as the PR was closed or merged.

stas00 · 2023-05-17T00:06:00Z

This is a very old and deprecated implementation since it doesn't even follow the AdamW algorithm exactly. One should use torch.optim.AdamW instead, which also has a fused version since pt-2.0.0 which is almost as fast as apex's fused AdamW. So really you shouldn't be using this version anyway.

The only reason it was kept is for BC for those who rely on exact results remaining exact after new transformers versions are released, otherwise we would have just replaced it with torch.optim.AdamW in the first place.

p.s. no objections though to making it better...

alanwaketan · 2023-05-17T00:09:19Z

@stas00 Thanks for the reply. How about the adafactor then?

stas00 · 2023-05-17T00:23:58Z

oh, sorry, I didn't see it was Adafactor too. It's hard to see from the diff as it doesn't show the class names.

This Adafactor is being used for sure, but its implementation is super old as well. So certainly it'd be a blessing to bring it up to more modern code standard.

JackCaoG · 2023-05-18T19:07:21Z

@stas00 Do you mind give this pr a review? Thanks.

stas00

I think once .data is removed the copy_ becomes somewhat difficult to understand in Adafactor's part of the diff. An explicit downcast would be more readable at the very end. But since p has to remain the same pointer I couldn't quite think of a better way.

Otherwise looks good. Thank you for modernizing, @alanwaketan

Let me just invite @sgugger to have a quick look before we merge.

sgugger · 2023-05-19T00:25:34Z

Thanks for your PR. Just to be sure though, is this all going to work with PyTorch 1.8+? 1.8 is the minimum version we offically support at the moment (for a couple more weeks at least, then 1.9 starting mid-June).

stas00 · 2023-05-19T00:33:36Z

I'm almost 100% sure it is the case. the whole direct .data usage deprecation is a few years old at least.

Let me quickly test it with pt-1.8

stas00 · 2023-05-19T00:36:32Z

$ pytest tests/optimization/test_optimization.py -k test_adafactor
========================================================== test session starts ===========================================================
platform linux -- Python 3.8.8, pytest-7.3.1, pluggy-0.13.1
rootdir: /mnt/nvme0/code/huggingface/transformers-master
configfile: setup.cfg
plugins: timeout-1.4.2, typeguard-2.12.1, flakefinder-1.0.0, forked-1.3.0, monitor-1.6.0, hypothesis-6.47.0, instafail-0.4.2, xdist-2.2.1
collected 3 items / 2 deselected / 1 selected

tests/optimization/test_optimization.py .                                                                                          [100%]

================================================================= PASSES =================================================================
======================================================== short test summary info =========================================================
PASSED tests/optimization/test_optimization.py::OptimizationTest::test_adafactor
============================================== 1 passed, 2 deselected, 4 warnings in 0.16s ===============================================

$ pt-ver
pt=1.8.2, cuda=11.1, nccl=2708

At least the Adafactor test that we have is passing.

alanwaketan · 2023-05-19T17:53:04Z

Thanks @sgugger and @stas00 for reviewing the changes.

Patched the optimizers

Patched the optimizers

94c7e74

stas00 approved these changes May 18, 2023

View reviewed changes

stas00 requested a review from sgugger May 18, 2023 22:14

sgugger approved these changes May 19, 2023

View reviewed changes

sgugger merged commit 8aa8513 into huggingface:main May 19, 2023

sheonhan pushed a commit to sheonhan/transformers that referenced this pull request Jun 1, 2023

Remove .data usages in optimizations.py (huggingface#23417)

1e066a6

Patched the optimizers

gojiteji pushed a commit to gojiteji/transformers that referenced this pull request Jun 5, 2023

Remove .data usages in optimizations.py (huggingface#23417)

d5f6a80

Patched the optimizers

shirayu mentioned this pull request Jun 22, 2023

Removed @torch.no_grad() and in-place operations in optimizers for backwards #24412

Closed

5 tasks

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

Remove .data usages in optimizations.py (huggingface#23417)

32becd0

Patched the optimizers

muellerzr mentioned this pull request Jul 25, 2023

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #25006

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove .data usages in optimizations.py #23417

Remove .data usages in optimizations.py #23417

Remove .data usages in optimizations.py #23417

Remove .data usages in optimizations.py #23417

Conversation

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment