-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for cost=nan issue while training #124
Comments
@david-bernstein Could you please detail your experiment environment such as tf version etc:) |
Sure, the tf version varies, but it is usually 1.10.0 or 1.11.0. The code is running on Ubuntu 16.04 and I've run it on machines with a 1070 and 1080Ti GPUs. This issue with the L2 norm gradient being unstable is known in tensorflow, see tensorflow/tensorflow#12071 I tried to implement the fix described in the post above but am not a TF expert and haven't gotten it to work yet. |
@david-bernstein Thanks for your kindly share with us. I will test it later:) |
@MaybeShewill-CV Have you tried this solution? How is the results? Thanks very much! |
I also met this problem and I've changed all normalization to L1 norm in lanenet_discriminative_loss.py. |
@yxxxxxxxx You may first use the model weights to do the testing work on your test dataset to see if the model performs as well as you wish:) |
@david-bernstein @yxxxxxxxx I've also met this problem, and change |
@yxxxxxxxx There are three calls to tf.norm in lanenet_discriminative_loss.py. I changed all of those to L1. |
@david-bernstein
whether the operations is right? |
@JunJieAI Yeah, I change all L2 normalization to L1. It indeed worked. |
I use tensorflow_gpu1.10.0。And I change all L2 normalization to L1,but i got nan loss.I don't know why. |
Could you please explain me, what changes did you made? |
I agree with a previous user that one source of the cost=nan error is the instability of the gradient of the L2 norm when operating on small or zero tensors. My experience is that during training I will get this error randomly but only every 30 or 40k steps. To prevent it, I changed the norm in the discriminative loss to the L1 norm, which does not have this stability issue. I have not gotten a nan during training since (several million steps over a number of training runs).
This of course changes the loss but the training seems to be just as effective as with the L2 norm.
The text was updated successfully, but these errors were encountered: