-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676
Comments
Please, share colab link or simple standalone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks! |
Here is a Google Colab with the code that exhibits this issue: https://colab.research.google.com/drive/1E_IZ3zmPrQHbuiYl47gvL591hHfS-VkN#scrollTo=L5nAouOwIcnE Please Note: This needs to run on a V100 GPU to recreate this issue. With TF 2.3 each epoch takes about 1 second versus 3 seconds with TF 2.4 |
Going from 1 second per epoch to 3 seconds, is a huge deal. Especially since my model is trained on AWS, which means I now experienced a 300% rise in costs... |
Please, grant me the access for colab link. Thanks! |
Just did! |
@nectario |
Here you go: Unzip the file to get all supporting files. |
Hi @nectario, does this reproduce on CPU also? Or just GPU?
|
Not sure if this is relevant to CPU as it takes 10 times longer, and I never measured the CPU times from 2.3.0 vs 2.4.0. I am not sure about the above error. Are you using the correct CUDA/CUDNN versions? |
With GPU is 1 second vs 3 seconds. |
It used to be 1 second. That is still 50% more. Can you run it with 2.3? |
The key thing with this issue is the timings between 2.3 and 2.4. If both timings are equal, there is no issue. If 2.3 is faster, that is the issue that needs to be looked into. |
Yes, I am trying to run it with 2.3 but I'm getting the error message I shared earlier. I need to debug that first. |
CUDA and CUDNN versions for 2.3 are different. |
Btw, it has been very difficult to train this model till completion. Was getting similar issues you are getting which I posted in a different thread a few months back. |
Adding this code at the start, dropped my per epoch time to 1 second!
|
@nectario I ran the colab code locally with both TF2.3.1 and TF2.4 and interestingly enough didn't observe that much of a difference. Original (batch_size=600 to fit my gpu) tf2.3.1 Mixed Precision (batch_size=620) tf2.3.1 It looks like one thing different between TF2.4 and TF2.3.1 is XLA no longer being enabled by default. Although not sure how much of it's automated magic was enabled in 2.3.1. This can be seen also from the TF "startup" log. Windows and XLA in TF2.4 doesn't seem to play along that well at the moment ( TF2.3.1: Win 10, graphics driver, 460.89, CUDA 10.1, cudnn-7.6.5.32 Possibly not relevant but mentioning just to be sure: I do have Hardware-accelerated GPU scheduling enabled in Windows. |
Nice, thank you @ahtik This is interesting. I just ran a 1500 epochs training and finished fine with no crashing with TF 2.4 (this is in reference to the crashing issue). Although, each epoch takes 3 seconds for me and has all your optimizations (mixed precision and LSTM output at 152).
|
@ahtik Now it seems with or without your optimization suggestions, it takes 3 seconds per epoch. |
Is this 100% the same code as in colab, other than the renamed losses? Output looks a bit different, for example no step time info? My 150 epoch run finished in ~4 minutes (
RTX 2070 Super |
@ahtik The code in colab is different than the other code. This one is a bit more elaborate but same core architecture. |
@ahtik I renamed the outputs to more generic names. |
@ahtik With this data, I get 3 seconds per epoch. |
With the I'm a bit curious, why your log output does not show step timing like TF 2.4, LSTM 152, Mixed Precision, ~4.25 minutes (0:04:14) TF 2.4, LSTM 152, Regular ("unmixed"), ~6.75 minutes (0:06:46) |
@ahtik I have never seen the ms portion when I train. Don't ever remember seeing it. I just see the seconds portion... |
TF 2.3.1, LSTM 152, Mixed Precision, ~4.5 minutes (0:04:29) TF 2.3.1, LSTM 152, Regular ("unmixed"), ~6 minutes I wish I knew what else to look for to figure out the difference in setup? |
@ahtik You have an RTX 3070 and I have a Titan V. It may be because the RTX has big differences in the architecture compared to the older Titan V. |
Mine is RTX 2070 Super, a bit older than 3070. Compute capability 7.5. I do
have gtx 1080 with Ubuntu somewhere on the cloud...
…On December 23, 2020 10:15:47 PM Nektarios Kalogridis ***@***.***> wrote:
@ahtik You have an RTX 3070 and I have a Titan V. It may be because the RTX
has big differences in the architecture compared to the older Titan V.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Is it possible that you were using mixed precision in 2.3 but not in 2.4? CC @reedwm |
@sanjoy Actually I take it back. At the moment with that enabled and not, I get 3 seconds per epoch. There was a discrepancy in the amount of data earlier when I observed this. I used the same data when I measured this now which is this: |
I just downgraded to TF 2.3.1 with CUDA 10.1/CUDNN 7.6 Training the same data takes 2 seconds per epoch versus 3 seconds per epoch on TF 2.4 So: TF 2.4.0: 3 seconds Both setups were identical. I simply downgraded the TF and CUDA versions.
|
@nectario, have you tried collecting TensorBoard profiles for 2.4 and 2.3 to see if something jumps out? |
I can try doing this. |
Is there any luck with this issue? Will it be addressed in any of the upcoming releases? |
Hello, this is also a relevant issue to me. I have a personal rtx 2070 super and each epoch takes around 5s these are the logs 2021-04-01 20:03:46.712589: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll |
Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3
Why is is slower??
When I train, I get this standard log (in case it's helful)
2020-12-14 18:34:49.552097: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2020-12-14 18:34:49.572089: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2020-12-14 18:34:49.671571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:49.671911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:50.350471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:50.350673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:50.456348: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:50.513180: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:50.881235: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.192188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.216230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.216525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:51.627424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:51.627803: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:51.627987: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:51.628149: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:51.628318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:51.628495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:51.628657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.628843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.629021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.629239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:52.792115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-12-14 18:34:52.792321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2020-12-14 18:34:52.792441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2020-12-14 18:34:52.792890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10243 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:83:00.0, compute capability: 7.0) 2020-12-14 18:34:52.830535: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Describe the expected behavior
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: