[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for cost=nan issue while training #124

Closed
david-bernstein opened this issue Dec 12, 2018 · 12 comments
Closed

Suggestion for cost=nan issue while training #124

david-bernstein opened this issue Dec 12, 2018 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@david-bernstein
Copy link
david-bernstein commented Dec 12, 2018

I agree with a previous user that one source of the cost=nan error is the instability of the gradient of the L2 norm when operating on small or zero tensors. My experience is that during training I will get this error randomly but only every 30 or 40k steps. To prevent it, I changed the norm in the discriminative loss to the L1 norm, which does not have this stability issue. I have not gotten a nan during training since (several million steps over a number of training runs).

This of course changes the loss but the training seems to be just as effective as with the L2 norm.

@MaybeShewill-CV
Copy link
Owner

@david-bernstein Could you please detail your experiment environment such as tf version etc:)

@MaybeShewill-CV MaybeShewill-CV added the help wanted Extra attention is needed label Dec 13, 2018
@david-bernstein
Copy link
Author

Sure, the tf version varies, but it is usually 1.10.0 or 1.11.0. The code is running on Ubuntu 16.04 and I've run it on machines with a 1070 and 1080Ti GPUs.

This issue with the L2 norm gradient being unstable is known in tensorflow, see tensorflow/tensorflow#12071

I tried to implement the fix described in the post above but am not a TF expert and haven't gotten it to work yet.

@MaybeShewill-CV
Copy link
Owner

@david-bernstein Thanks for your kindly share with us. I will test it later:)

@fayechou
Copy link

@david-bernstein Thanks for your kindly share with us. I will test it later:)

@MaybeShewill-CV Have you tried this solution? How is the results? Thanks very much!

@yxxxxxxxx
Copy link

I also met this problem and I've changed all normalization to L1 norm in lanenet_discriminative_loss.py.
My train accuracy has reached to 97% but my validation accuracy is only 50% after 200k steps. Actually my validation accuracy has been 50% since 100k steps. My global_config.py is unchanged. I wonder if you can give me some suggestions for this situation. Thank you!

@MaybeShewill-CV
Copy link
Owner

@yxxxxxxxx You may first use the model weights to do the testing work on your test dataset to see if the model performs as well as you wish:)

@cs-heibao
Copy link

@david-bernstein @yxxxxxxxx I've also met this problem, and change
tf.norm(mu_diff_bool, axis=1) to tf.norm(mu_diff_bool,ord=1, axis=1), but still got nan, how did you do?
thanks!

@david-bernstein
Copy link
Author

@yxxxxxxxx There are three calls to tf.norm in lanenet_discriminative_loss.py. I changed all of those to L1.

@cs-heibao
Copy link

@david-bernstein
yes, I've changed all of those, as follow:

# 计算公式的loss(var)
distance = tf.norm(tf.subtract(mu_expand, reshaped_pred), ord=1, axis=1)
.........

mu_norm = tf.norm(mu_diff_bool,ord=1, axis=1)
.........

# 
l_reg = tf.reduce_mean(tf.norm(mu, ord=1,axis=1))

whether the operations is right?

@yxxxxxxxx
Copy link

@JunJieAI Yeah, I change all L2 normalization to L1. It indeed worked.

@github-luffy
Copy link

I use tensorflow_gpu1.10.0。And I change all L2 normalization to L1,but i got nan loss.I don't know why.

@Berrlinn
Copy link

tf.norm(tf.subtract(mu_expand, reshaped_pred), ord=1, axis=1)

@JunJieAI Yeah, I change all L2 normalization to L1. It indeed worked.

Could you please explain me, what changes did you made?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants