-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574
Comments
Can you compare behavior on 1.15, 2.0 and 2.1 too please? |
Thank you for your response. I'll compare this on 1.15, 2.0 and 2.1 aswell. I might just not be able to do it until next week. Until then:
|
I am also facing the issue of Out of memory. I am using tensorflow 2.0 and have enabled the allow_growth option. |
I have tried in colab with TF-GPU with version 1.15,2.0,2.1, 2.2 and was able to reproduce the issue.Please, find the gist here.Thanks! |
@ravikyram thank you for reproducing the problem! If you need me to provide any further information to this, please don't hesitate to comment. |
Just tested the same procedure (with TF 1.15 and TF 2.2) on a larger AWS instance (p3.8xlarge with 4 GPUs). @ravikyram Maybe this helps assessing the scope of the problem. |
@Philipduerholt Did you update your cuda version to 10.1 when using TF 2.2? |
@ymodak No, the CUDA version is 10.0.130 |
The example code crashes on google colab with TF version 1.14 as well as 2.2 on cpu/gpu runtime. |
Could reproduce the issue with |
Keras's numpy support involves data copies and is not designed for performant processing of large numpy datasets. I suggest either using tf.data to construct your inputs directly, or give |
Feel free to reopen if the above suggestion does not work for you. |
System information
Describe the current behavior
Training a simple model on a data set:
X shape: (35000, 200, 311)
y shape: (35000, 200, 19)
Right before the start of the first epoch, when using TF 2.2.0 the training stops with an error:
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]
(abbreviated for readability and because this error seems to point to Out Of Memory issues indirectly)
(also because the code example below can reproduce the error message)
Interestingly: The exact same setup, model and training set trains well on TF 1.14
NOTES:
allow_growth
option didn't make a difference.Describe the expected behavior
Training above model runs as expected and does not error out.
Standalone code to reproduce the issue
The colab link can be executed, but if not run on the specified hardware, it wouldn't make sense to test this way.
Best to reproduce is running following code on the described setup with TF 1.14 and TF 2.2.0
Other info / logs
Full traceback of error:
The text was updated successfully, but these errors were encountered: