Suggestion for cost=nan issue while training #124

david-bernstein · 2018-12-12T19:49:44Z

I agree with a previous user that one source of the cost=nan error is the instability of the gradient of the L2 norm when operating on small or zero tensors. My experience is that during training I will get this error randomly but only every 30 or 40k steps. To prevent it, I changed the norm in the discriminative loss to the L1 norm, which does not have this stability issue. I have not gotten a nan during training since (several million steps over a number of training runs).

This of course changes the loss but the training seems to be just as effective as with the L2 norm.

MaybeShewill-CV · 2018-12-13T03:08:43Z

@david-bernstein Could you please detail your experiment environment such as tf version etc:)

david-bernstein · 2018-12-13T18:45:57Z

Sure, the tf version varies, but it is usually 1.10.0 or 1.11.0. The code is running on Ubuntu 16.04 and I've run it on machines with a 1070 and 1080Ti GPUs.

This issue with the L2 norm gradient being unstable is known in tensorflow, see tensorflow/tensorflow#12071

I tried to implement the fix described in the post above but am not a TF expert and haven't gotten it to work yet.

MaybeShewill-CV · 2018-12-14T02:42:08Z

@david-bernstein Thanks for your kindly share with us. I will test it later:)

fayechou · 2018-12-19T05:45:30Z

@david-bernstein Thanks for your kindly share with us. I will test it later:)

@MaybeShewill-CV Have you tried this solution? How is the results? Thanks very much!

yxxxxxxxx · 2018-12-22T08:20:02Z

I also met this problem and I've changed all normalization to L1 norm in lanenet_discriminative_loss.py.
My train accuracy has reached to 97% but my validation accuracy is only 50% after 200k steps. Actually my validation accuracy has been 50% since 100k steps. My global_config.py is unchanged. I wonder if you can give me some suggestions for this situation. Thank you!

MaybeShewill-CV · 2018-12-24T02:41:41Z

@yxxxxxxxx You may first use the model weights to do the testing work on your test dataset to see if the model performs as well as you wish:)

cs-heibao · 2019-04-13T09:00:30Z

@david-bernstein @yxxxxxxxx I've also met this problem, and change
tf.norm(mu_diff_bool, axis=1) to tf.norm(mu_diff_bool,ord=1, axis=1), but still got nan, how did you do?
thanks!

david-bernstein · 2019-04-15T18:24:30Z

@yxxxxxxxx There are three calls to tf.norm in lanenet_discriminative_loss.py. I changed all of those to L1.

cs-heibao · 2019-04-17T09:13:29Z

@david-bernstein
yes, I've changed all of those, as follow:

# 计算公式的loss(var)
distance = tf.norm(tf.subtract(mu_expand, reshaped_pred), ord=1, axis=1)
.........

mu_norm = tf.norm(mu_diff_bool,ord=1, axis=1)
.........

# 
l_reg = tf.reduce_mean(tf.norm(mu, ord=1,axis=1))

whether the operations is right?

yxxxxxxxx · 2019-04-17T09:16:59Z

@JunJieAI Yeah, I change all L2 normalization to L1. It indeed worked.

github-luffy · 2019-10-12T08:55:21Z

I use tensorflow_gpu1.10.0。And I change all L2 normalization to L1,but i got nan loss.I don't know why.

Berrlinn · 2022-03-29T15:22:45Z

tf.norm(tf.subtract(mu_expand, reshaped_pred), ord=1, axis=1)

@JunJieAI Yeah, I change all L2 normalization to L1. It indeed worked.

Could you please explain me, what changes did you made?

MaybeShewill-CV mentioned this issue Dec 13, 2018

训练过程还是出现nan #116

Closed

MaybeShewill-CV added the help wanted Extra attention is needed label Dec 13, 2018

MaybeShewill-CV closed this as completed May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for cost=nan issue while training #124

Suggestion for cost=nan issue while training #124

Suggestion for cost=nan issue while training #124

Suggestion for cost=nan issue while training #124

Comments