Fix crash in CLR optimizer callback #2172

DavidWAbrahams · 2020-09-22T04:22:52Z

I experienced the following stack trace when TF 2.4 nightly passes "step" as an int64. This happens, for example, when using the nightly version of TensorBoard.

Minimal repro: https://pastebin.com/Ni2dgDgh

Environment:
tf-nightly-gpu 2.4.0.dev20200917
tfa-nightly 0.12.0.dev20200918223509
tb-nightly 2.4.0a20200921

Description

File "...\Python38\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1117, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 427, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2274, in on_epoch_end
self._log_epoch_metrics(epoch, logs)
File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2316, in _log_epoch_metrics
train_logs = self._collect_learning_rate(train_logs)
File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2301, in _collect_learning_rate
logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations)
File "...\Python38\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 94, in call
cycle = tf.floor(1 + step / (2 * step_size))
File "...\Python38\lib\site-packages\tensorflow\python\ops\variables.py", line 1074, in _run_op
return tensor_oper(a.value(), *args, **kwargs)
File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1155, in binary_op_wrapper
raise e
File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1139, in binary_op_wrapper
return func(x, y, name=name)
File "...\Python38\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1311, in truediv
return _truediv_python3(x, y, name)
File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1241, in _truediv_python3
raise TypeError("x and y must have the same dtype, got %r != %r" %
TypeError: x and y must have the same dtype, got tf.int64 != tf.float32

Brief Description of the PR:

Fixes # (issue)

Type of change

Checklist:

I've properly formatted my code according to the guidelines
- By running Black + Flake8
- By running pre-commit hooks
This PR addresses an already submitted issue for TensorFlow Addons
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
This PR contains modifications to C++ custom-ops

How Has This Been Tested?

Run on
tf-nightly-gpu 2.4.0.dev20200917
and
tfa-nightly 0.12.0.dev20200918223509

My project code looks like:
clr = CyclicalLearningRate(initial_learning_rate=1e-3,
maximal_learning_rate=1e-2,
step_size=3*STEPS_PER_EPOCH_TRAIN,
scale_fn=lambda x:1.)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
optimizer = LAMB(clr)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

If you're adding a bugfix or new feature please describe the tests that you ran to verify your changes:
*

I experienced the following stack trace when TF 2.4 nightly passed "step" as an int64. Perhaps that is new behavior? File "...\Python38\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1117, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 427, in on_epoch_end callback.on_epoch_end(epoch, logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2274, in on_epoch_end self._log_epoch_metrics(epoch, logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2316, in _log_epoch_metrics train_logs = self._collect_learning_rate(train_logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2301, in _collect_learning_rate logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations) File "...\Python38\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 94, in __call__ cycle = tf.floor(1 + step / (2 * step_size)) File "...\Python38\lib\site-packages\tensorflow\python\ops\variables.py", line 1074, in _run_op return tensor_oper(a.value(), *args, **kwargs) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1155, in binary_op_wrapper raise e File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1139, in binary_op_wrapper return func(x, y, name=name) File "...\Python38\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper return target(*args, **kwargs) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1311, in truediv return _truediv_python3(x, y, name) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1241, in _truediv_python3 raise TypeError("x and y must have the same dtype, got %r != %r" % TypeError: x and y must have the same dtype, got tf.int64 != tf.float32

googlebot · 2020-09-22T04:22:57Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

bot-of-gabrieldemarmiesse · 2020-09-22T04:23:31Z

@RaphaelMeudec

You are owner of some files modified in this pull request.
Would you kindly review the changes whenever you have the time to?
Thank you very much.

googlebot · 2020-09-22T05:22:46Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

DavidWAbrahams · 2020-09-22T06:26:34Z

@googlebot I signed it!

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

WindQAQ · 2020-09-22T05:53:45Z

tensorflow_addons/optimizers/cyclical_learning_rate.py

@@ -91,8 +91,9 @@ def __call__(self, step):
            dtype = initial_learning_rate.dtype
            maximal_learning_rate = tf.cast(self.maximal_learning_rate, dtype)
            step_size = tf.cast(self.step_size, dtype)
-            cycle = tf.floor(1 + step / (2 * step_size))
-            x = tf.abs(step / step_size - 2 * cycle + 1)
+            step_as_dtype = tf.cast(step, dtype)


How about

step = tf.cast(step, dtype)

This should conform to the naming convention in this script.

That was also my instinct, but it causes a unit test failure. ("step" is accessed later in the method and is expected to still have the original dtype)

I could do this without a local variable, just using "tf.cast(step, dtype)" where needed. Would that be better?

@DavidWAbrahams Thanks for clarification. Your approach is better.

Can you also share the minimal runnable code snippet to reproduce original issue? Thank you!

Here's my minimal repro. It only triggers when I have added a TensorBoard callback. I guess that callback is somehow clobbering the dtype of "step".

https://pastebin.com/Ni2dgDgh

For more detail, my versions are
tf-nightly-gpu 2.4.0.dev20200917
tfa-nightly 0.12.0.dev20200918223509
tb-nightly 2.4.0a20200921

bhack · 2020-09-22T18:14:59Z

Is this related to tensorflow/tensorflow#26407 right?

DavidWAbrahams · 2020-09-22T18:33:21Z

@bhack Thanks, yes that is probably what triggers my crash.

But even if that issue is eventually fixed, I think it's best if the cyclical learning rate callback sanitizes its inputs.

WindQAQ

Thank you!

* Fix crash in CRL optimizer callback I experienced the following stack trace when TF 2.4 nightly passed "step" as an int64. Perhaps that is new behavior? File "...\Python38\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1117, in fit callbacks.on_epoch_end(epoch, epoch_logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 427, in on_epoch_end callback.on_epoch_end(epoch, logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2274, in on_epoch_end self._log_epoch_metrics(epoch, logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2316, in _log_epoch_metrics train_logs = self._collect_learning_rate(train_logs) File "...\Python38\lib\site-packages\tensorflow\python\keras\callbacks.py", line 2301, in _collect_learning_rate logs['learning_rate'] = lr_schedule(self.model.optimizer.iterations) File "...\Python38\lib\site-packages\tensorflow_addons\optimizers\cyclical_learning_rate.py", line 94, in __call__ cycle = tf.floor(1 + step / (2 * step_size)) File "...\Python38\lib\site-packages\tensorflow\python\ops\variables.py", line 1074, in _run_op return tensor_oper(a.value(), *args, **kwargs) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1155, in binary_op_wrapper raise e File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1139, in binary_op_wrapper return func(x, y, name=name) File "...\Python38\lib\site-packages\tensorflow\python\util\dispatch.py", line 201, in wrapper return target(*args, **kwargs) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1311, in truediv return _truediv_python3(x, y, name) File "...\Python38\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1241, in _truediv_python3 raise TypeError("x and y must have the same dtype, got %r != %r" % TypeError: x and y must have the same dtype, got tf.int64 != tf.float32 * Attempt to fix unit test failure

boring-cyborg bot added the optimizers label Sep 22, 2020

googlebot added the cla: no label Sep 22, 2020

Attempt to fix unit test failure

646b97f

googlebot added cla: yes and removed cla: no labels Sep 22, 2020

WindQAQ reviewed Sep 22, 2020

View reviewed changes

DavidWAbrahams changed the title ~~Fix crash in CRL optimizer callback~~ Fix crash in CLR optimizer callback Sep 22, 2020

WindQAQ self-requested a review September 25, 2020 01:14

WindQAQ approved these changes Sep 25, 2020

View reviewed changes

WindQAQ merged commit c7b867a into tensorflow:master Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash in CLR optimizer callback #2172

Fix crash in CLR optimizer callback #2172

Fix crash in CLR optimizer callback #2172

Fix crash in CLR optimizer callback #2172

Conversation

Description

Type of change

Checklist:

How Has This Been Tested?

What to do if you already signed the CLA

Individual signers

Corporate signers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment