RFC: Easily Customizable Optimizer.minimize #234

omalleyt12 · 2020-04-20T20:04:00Z

Status	In Revision
RFC #	234
Author(s)	omalleyt12@
Sponsor	apassos@, fchollet@, karmel@
Updated	2020-04-20

Objective

Create an Optimizer API that gives Optimizer subclasses full control of gradient updates. The API should ensure Optimizers can be accessed via a unified API, and will not leak abstractions. Training loops should not be required to know the internal details of how the Optimizer chooses to:

Scale losses and gradients
Aggregate gradients
Clip gradients
etc

We also need to ensure we maintain endpoints with maximum flexibility for those users who do want control over these items.

By creating this API, it will enable users to write training loops that are interoperable with a wide range of Optimizers.

Specific use cases considered:

Gradient clipping
Mixed precision
Horovod

gabrieldemarmiesse · 2020-04-20T21:12:02Z

In tensorflow Addons, we have multiple wrappers/hooks that need to work with arbitrary optimizers. So we need to make sure the API covers those use cases:

@Squadrick could you take a look and ensure that the Stochastic Weight Averaging and the moving average wrapper can be implemented with this new API?
@CyberZHG could you take a look and ensure that the lookahead mechanism can be implemented with the new API?
@PhilJd could you take a look and ensure the decoupled weight decay mechanism can be implemented with this new API?

Thank you in advance. After this RFC is merged and implemented, we should migrate those 4 wrappers to the new API.

Link to the nicely formatted document: https://github.com/omalleyt12/community/blob/patch-1/rfcs/2020-04-20-Optimizer-minimize.md

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper?

If possible, could you all write some pseudo code of what each wrapper implementation would look like with this new API? That would allow us to answer this last question.

ematejska · 2020-04-20T21:34:14Z

Please send out an email to tensorflow@developers.org with the announcement of this pull request and give at least a two week review period per: https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md

PhilJd · 2020-04-21T13:11:54Z

Thanks for the ping @gabrieldemarmiesse!
Unfortunately, I've never used keras, so I'm missing a lot of context and I might oversee some solution to this - I'd be super happy about some comments on the propsals below :)

Our API use case:
In general, the idea of decoupled weight decay is to decouple the decaying variable update from the optimizer variable update. While weight decay is equal to L2 penalty on weights for some optimizers, this is not true for adaptive optimizers such as Adam. So the more general solution would be something like (pseudo code)

vars = optimizer.apply_gradients(grads_vars)
vars -= weight_decay * vars

The challenge here is that the set of variables to be decayed might not be the same as the set of trainable variables (e.g., we might want to exclude batch norm parameters). For maximum flexibility
we've been wrapping the optimizers minimize method and allowed to pass a list with the variables to decay as minimize(loss, var_list, ..., decay_var_list=None).

Possible solutions:

Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.
Create a second optimzier that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)
Always decay all variables. To not decay some variables, one would need to create separate optimizers
for those. Again, I don't know if using two or more different optimizers is possible in Keras (especially with model.fit).

omalleyt12 · 2020-04-21T16:31:22Z

Thanks for looping these people in @gabrieldemarmiesse!

@PhilJd Thanks for the comments!

Re weight decay: In general you can still wrap Optimizer.minimize to perform this:

class WeightDecayOptimizer(Optimizer):

  def minimize(self, loss, variables, tape, decay_variables=None):
    if decay_variables is None:
      decay_variables = variables
   decay_variables -= self.weight_decay * decay_variables
   super(WeightDecayOptimizer, self).minimize(loss, variables, tape)

The decay_variables argument would not be passed in the default Model.fit, so only the use case where the variables and decay_variables are the same would be supported. This is the case for Model.fit today though, so that wouldn't be a regression. Users could also override Model.train_step (which allows users to provide their own custom training step logic) to support this use case if they want.

Make the decay_var_list a member variable of the optimizer. This might be annoying if variable to minimize differ between train steps.

This would work, although it's worth noting as well that oftentimes, Variables are created when the Model is first called on data, so for most workflows the Model Variables aren't guaranteed to exist when the Optimizer is created.

Create a second optimizer that only decays variables and chain the optimizers (I don't know if chaining optimizers in keras is possible)

This wouldn't work with the default Model.fit logic, whereas the decay_variables argument pattern at least works as long as the decay_variables and variables are the same.

In general there's probably a question of whether Model.fit should have a separate configuration for how to decay variables, since it seems something orthogonal to the Optimizer (in that IIUC, it never needs to know about the gradients)

rfcs/2020-04-20-Optimizer-minimize.md

fchollet

Two comments:

Some steps are configurable via Optimizer constructor arguments. Why not all? We should be consistent.
The API expects users to override private methods. We should avoid this. Any method that is meant to be overridable should be public.

Recommendation:

Keep the methods transform_loss, transform_unaggregated_gradients, transform_gradients private, but make them all configurable via constructor arguments. Since they can be configured in the constructor, they should not be overridden by users (users can override the constructor).
Make apply_updates & get_gradients public (they're not meant to be configurable via a constructor argument), since they are meant to be overridden.
Make method signatures consistent by having get_gradients return grads_and_vars instead of just gradients (since all subsequent methods expect grads_and_vars)
Consider renaming transform_unaggregated_gradients and transform_gradients to highlight the difference between the two (e.g. in a non-distributed setting, is it transform_unaggregated_gradients that is applied to the gradients, or transform_gradients, or both?)

fchollet · 2020-04-21T20:27:43Z

rfcs/2020-04-20-Optimizer-minimize.md

+      grads_and_vars = self._transform_gradients(grads_and_vars)
+    return grads_and_vars
+
+  def apply_gradients(self, grads_and_vars, aggregate=True):


The difference between apply_updates and apply_gradients may be confusing. We should consider renaming apply_updates to highlight the difference (since we can't rename apply_gradients.

rfcs/2020-04-20-Optimizer-minimize.md

karmel · 2020-04-22T16:35:41Z

rfcs/2020-04-20-Optimizer-minimize.md

+     self.transform_gradients_fns = transform_gradients_fns
+
+  def _transform_loss(self, loss):
+    # Can be overridden in subclasses


Can subclassers rely on underscored method names? If these are backwards-compat, we should make it clear with no underscore. If they are not, we should not encourage subclassers to rely on them.

I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.

That's why I prefer that these things are constructor arguments instead of methods to be overridden.

@alextp how about we design separate the TF/Keras logic from the "user" logic.
Some "pseudo-code" below:

# Abstract Class class _WrappingOptimizer(): # Keep any inner workings here def _pre_grad_aggregation(self, *a, **kw): # do smthg self.transform_loss(a, kw) # do smthg def _aggregate_grad(self, *a, **kw): pass ... class WrappingOptimizer(_WrappingOptimizer): def transform_loss(self, *a, **kw): pass def aggregate_grads(self, *a, **kw): pass

Not sure if it is clear ;)
We could separate how Keras will use and call the different methods. And what we expect the user to overwrite and modify.

That way we can keep a stable public API and change as often as required how we use the exposed API.

Going that way would probably be a little more readable. Passing around functions makes it quite hard to read the code ...

I changed the methods to public for now and added it as a discussion item for whether they should be private or public.

Re methods vs init arguments, my opinion is this: init arguments are great for the simple use cases we expect most users to have (e.g. gradient clipping, custom weight decay, aggregating by mean rather than by sum, etc). In these examples, each init argument is self-contained.

However, for more advanced use cases like LossScaling and differential privacy optimizers, the transformations needed at each step of the process (loss transform, unaggregated gradient transform, gradient aggregation, aggregated gradient transform) are tightly coupled. In these cases, having a subclass that contains all of this tightly coupled logic seems to make the most sense to me.

I think the current design, where the two most common transforms (aggregation and post-aggregations transformation) can be passed as __init__ arguments, and every discrete step of the minimize process has its own overridable method, achieves the best of the both worlds: simple use cases don't require subclassing, and advanced users have maximum flexibility

@reedwm re related comment

rfcs/2020-04-20-Optimizer-minimize.md

karmel · 2020-04-22T16:38:55Z

rfcs/2020-04-20-Optimizer-minimize.md

+    # Can be overridden in subclasses
+    return loss
+
+  def _get_gradients(self, loss, variables, tape):


Can you add docstrings to this proposal? It's not obvious from reading through what the contract for each of these methods is. What can/must this return? What are each of these args? As a subclasser, those are important details to understand.

rfcs/2020-04-20-Optimizer-minimize.md

reedwm · 2020-04-24T23:11:27Z

/CC @DEKHTIARJonathan, @tgaddair. This API will be relevant to Horovod's distributed optimizer.

DEKHTIARJonathan · 2020-04-24T23:58:14Z

Thanks @reedwm, much appreciated to be informed :)

I actually truely like the idea of this RFC. I have defended many points of the design made in this RFC in this PR: tensorflow/tensorflow#36398.

About:

Should we create a utility class to help with wrapping an Optimizer? I.e. OptimizerWrapper? @omalleyt12

I would be in favor of this. Having OptimizerWrapper would greatly improved the maintability of AMP and Horovod and allow us to provide "a standard" way to plug into TensorFlow for anyone thinking of doing it. And on the core API side, it simplifies API support because you can state that the public API of the OptimizerWrapper class are the only ones supported on the long term and in the background interface it in whatever way is needed, totally abstract from Horovod or AMP Optimizers.

Overall, I'm really enthusiastic about this RFC. Sounds like a great improvement 👍

rfcs/2020-04-20-Optimizer-minimize.md

tomerk · 2020-05-01T02:40:05Z

rfcs/2020-04-20-Optimizer-minimize.md

+     self.transform_gradients_fns = transform_gradients_fns
+
+  def _transform_loss(self, loss):
+    # Can be overridden in subclasses


I think there is an interesting naming question around this that pops up elsewhere too: How should we name methods intended for overriding that users shouldn't call directly? We run into this w/ e.g. training step/ make_train_function overriding too.

rfcs/2020-04-20-Optimizer-minimize.md

tomerk · 2020-05-01T02:43:45Z

rfcs/2020-04-20-Optimizer-minimize.md

+  def apply_gradients(self, grads_and_vars, aggregate=True):
+    if aggregate:
+      grads_and_vars = self._aggregate_gradients(grads_and_vars)
+      grads_and_vars = self._transform_gradients(grads_and_vars)  # No-op by default


Again for this check: shouldn't this transformation happen regardless of aggregation?
Are we guarding it because there's the dynamic where aggregate_gradients must be called once to avoid double-aggregation?

Since transform_gradients is a method that applies only to aggregated gradients, in this case it can't be performed w/o aggregation

gabrieldemarmiesse · 2020-05-02T12:16:05Z

@hyang0129 could you take a look at this RFC to see if it would make tensorflow/addons#969 easier to implement? Maybe add some comments if you have ideas.

@Squadrick @CyberZHG friendly ping to see if this rfc can make the implementation of Stochastic Weight Averaging , moving average wrapper and lookahead mechanism simpler in tensorflow addons.

hyang0129 · 2020-05-02T14:55:32Z

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

omalleyt12 · 2020-05-04T20:51:27Z

@omalleyt12 gradient transformations may have unintended consequences when paired with moment based optimizers such as Adam. These optimizers usually take the first and second moment of the gradient to compute adaptive learning rates for each parameter. Scaling a gradient by a factor X will not scale the parameter update by a factor of X, because the moments of the gradient do not scale linearly with the gradient.

@hyang0129 Agreed, the loss scaling optimizer example doesn't actually scale the gradients. What it does is that it temporarily scales up the loss to compute gradients in a numerically stable way, and then unscales them before applying Variable updates

For other, user-written gradient transformations such as gradient clipping, ensuring that the math is doing what they want is up to them

Squadrick · 2020-05-05T06:15:12Z

@gabrieldemarmiesse Sorry about the delay. Moving average_wrapper would be fairly easy. We need to override transform_unaggregated_gradientsonly.

Once average_wrapper is migrated, migrating moving_average and stochastic_weight_average optimizers would be very simple.

karmel · 2020-05-05T15:48:46Z

rfcs/2020-04-20-Optimizer-minimize.md

@@ -276,6 +284,9 @@ Or, if backwards compatibility is not needed, simply:
 optimizer = tf.keras.optimizers.Adam(1e-3, aggregate_gradients=horovod.aggregate)
 ```

+## `OptimizerWrapper`
+
+With this proposal, we should also release an `OptimizerWrapper` class. This class will make it easier for developers to create subclasses that wrap an `Optimizer` while providing additional functionality, such as mixed-precision, Horovod, or differential privacy use cases.


Will the wrapper go through separate review? What APIs will it provide?

karmel · 2020-05-05T15:58:10Z

rfcs/2020-04-20-Optimizer-minimize.md

+     if aggregate_gradients is None:
+       aggregate_gradients = all_reduce_sum
+     self.aggregate_gradients_fn = aggregate_gradients
+     self.transform_gradients_fns = transform_gradients


It's unexpected that aggregate_gradients is a single fn, and transform_gradients is a list of fns. The naming with fn and fns is at least more clear, though one wonders if it's also worth making the two args consistent. Eg, allowing list-wise composition with aggregate gradients, or, better yet, requiring one fn for transform and expecting that users compose the fn themselves. That is--

# Current input: transform_gradients_fns = [fn_1, fn_2] # Composition on the user side def transform_gradients_composed(grads_and_vars): return fn_2(fn_1(grads_and_vars)) Optimizer(transform_gradients=transform_gradients_composed)

Would that work, or is there some reason that's less preferable?

karmel · 2020-05-05T16:01:12Z

rfcs/2020-04-20-Optimizer-minimize.md

+## Questions and Discussion Topics
+
+(1) Should these new hooks be private or public methods?
+(2) Should we create an initializer argument for each hook, or only for the ones we expect most users to need (`aggregate_gradients` and `transform_gradients`)?


2.1 Or no constructors or no overwritables?
3. What are the contracts for the constructor arg functions? Single fns, lists, something more formal that we can better error/type-check?

ematejska · 2020-05-12T15:19:25Z

@omalleyt12 Please post the notes from the design review here.

ematejska · 2020-06-09T04:59:35Z

@omalleyt12 A gentle ping so we can finish this up.

ematejska · 2020-06-09T16:08:29Z

@omalleyt12 told me that this design will likely need another round of review at some point as there were differing opinions.

adriendoerig · 2020-08-19T09:18:55Z

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

byronyi · 2020-08-19T15:19:23Z

Hello, do you have any information about when optimizers will be allowed to perform global gradient clipping across several GPUs? Some of my work quite critically depends on this. Thank you!

tensorflow/tensorflow@5357606

DEKHTIARJonathan · 2020-08-19T19:01:10Z

@adriendoerig if you use Horovod that would be already quite easy to perform :)

Padarn · 2020-08-31T14:47:12Z

rfcs/2020-04-20-Optimizer-minimize.md

+
+This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).
+
+Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.


I think the two line references here should probably not reference master branch: they no longer point to the code you want to (as far as I can tell)

DepenM · 2021-12-28T12:06:44Z

Just wondering, why clip by global norm also not easily supported for distribution strategy? It would be really helpful as it keeps the gradient direction unchanged.

ematejska · 2022-01-31T18:40:25Z

@omalleyt12, @fchollet Are you still pursuing this or should this be closed out?

Create 2020-04-20-Optimizer-minimize.md

e42741e

omalleyt12 requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners April 20, 2020 20:04

googlebot added the cla: yes label Apr 20, 2020

Update 2020-04-20-Optimizer-minimize.md

04a1ab9

ematejska added the RFC: Proposed RFC Design Document label Apr 20, 2020

fchollet reviewed Apr 21, 2020

View reviewed changes

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved

Update 2020-04-20-Optimizer-minimize.md

b3f6e45

fchollet reviewed Apr 21, 2020

View reviewed changes

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved

fchollet reviewed Apr 21, 2020

View reviewed changes

karmel reviewed Apr 22, 2020

View reviewed changes

ematejska changed the title ~~Easily Customizable Optimizer.minimize RFC~~ RFC: Easily Customizable Optimizer.minimize Apr 23, 2020

schien1729 mentioned this pull request Apr 24, 2020

DP doesn't work in TF2.0rc0 tensorflow/privacy#72

Closed

reedwm reviewed Apr 29, 2020

View reviewed changes

rfcs/2020-04-20-Optimizer-minimize.md Show resolved Hide resolved

rfcs/2020-04-20-Optimizer-minimize.md Show resolved Hide resolved

rfcs/2020-04-20-Optimizer-minimize.md Outdated Show resolved Hide resolved

tomerk reviewed May 1, 2020

View reviewed changes

omalleyt12 added 2 commits May 4, 2020 13:25

Update 2020-04-20-Optimizer-minimize.md

4aac7f3

Update 2020-04-20-Optimizer-minimize.md

642626e

omalleyt12 added 2 commits May 4, 2020 13:37

Update 2020-04-20-Optimizer-minimize.md

02a04d8

Update 2020-04-20-Optimizer-minimize.md

488d371

byronyi mentioned this pull request May 4, 2020

RFC: Sparse Domain Isolation for Supporting large-scale Sparse Weights Training. #237

Open

karmel reviewed May 5, 2020

View reviewed changes

Update 2020-04-20-Optimizer-minimize.md

dbe788e

tomerk mentioned this pull request May 26, 2020

Optimizer clipvalue and clipnorm not working in Tensorflow 2.0 tensorflow/tensorflow#33929

Closed

ematejska removed the RFC: Proposed RFC Design Document label Jul 7, 2020

Update status to in revision.

c0d7c08

omalleyt12 mentioned this pull request Aug 7, 2020

Allow Optimizers to perform global gradient clipping tensorflow/tensorflow#36002

Closed

Padarn reviewed Aug 31, 2020

View reviewed changes

PhilJd mentioned this pull request Oct 2, 2020

Make API for custom optimizer wrappers more consistent tensorflow/addons#2187

Closed

reedwm mentioned this pull request Oct 14, 2020

RFC: Keras Mixed Precision API #293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Easily Customizable Optimizer.minimize #234

RFC: Easily Customizable Optimizer.minimize #234


		This section discusses the experience of supporting mixed-precision and Horovod in Keras’s built-in training logic (hereafter called Model.fit).

		Keras now allows users to write custom training logic for their `Model`s via overriding `Model.train_step`: [code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/engine/training.py#L538). The default implementation of this method is 8 lines long, and fully supports all types of `Model`s, `loss`es, `metric`s, etc that Keras supports. It attempts to serve as a reference that users can copy / paste to start writing their own training logic.

RFC: Easily Customizable Optimizer.minimize #234

Are you sure you want to change the base?

RFC: Easily Customizable Optimizer.minimize #234

Conversation

Objective

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment