Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

nectario · 2020-12-14T23:40:30Z

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3

Why is is slower??

When I train, I get this standard log (in case it's helful)

2020-12-14 18:34:49.552097: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2020-12-14 18:34:49.572089: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2020-12-14 18:34:49.671571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:49.671911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:50.350471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:50.350673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:50.456348: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:50.513180: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:50.881235: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.192188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.216230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.216525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:51.627424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:51.627803: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:51.627987: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:51.628149: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:51.628318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:51.628495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:51.628657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.628843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.629021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.629239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:52.792115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-12-14 18:34:52.792321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2020-12-14 18:34:52.792441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2020-12-14 18:34:52.792890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10243 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:83:00.0, compute capability: 7.0) 2020-12-14 18:34:52.830535: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below):
Python version: 3.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.0 / 8.02
GPU model and memory: NVIDIA Titan V 12 GB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

The text was updated successfully, but these errors were encountered:

ravikyram · 2020-12-15T06:03:56Z

@nectario

Please, share colab link or simple standalone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks!

nectario · 2020-12-15T21:10:34Z

@ravikyram

Here is a Google Colab with the code that exhibits this issue:

https://colab.research.google.com/drive/1E_IZ3zmPrQHbuiYl47gvL591hHfS-VkN#scrollTo=L5nAouOwIcnE

Please Note: This needs to run on a V100 GPU to recreate this issue. With TF 2.3 each epoch takes about 1 second versus 3 seconds with TF 2.4

nectario · 2020-12-16T04:13:56Z

Going from 1 second per epoch to 3 seconds, is a huge deal. Especially since my model is trained on AWS, which means I now experienced a 300% rise in costs...

ravikyram · 2020-12-16T08:53:09Z

@nectario

Please, grant me the access for colab link. Thanks!

nectario · 2020-12-16T13:00:22Z

Just did!

ravikyram · 2020-12-16T15:25:39Z

@nectario
Will it be possible to share supporting data files(Eg:stock_data_df.pkl ) to reproduce the issue.Thanks!

nectario · 2020-12-16T15:28:49Z

Here you go:

supporting_files.zip

Unzip the file to get all supporting files.

nikitamaia · 2020-12-21T22:22:53Z

Hi @nectario, does this reproduce on CPU also? Or just GPU?
Additionally, I tried to reproduce this with 1 V100. I copied over the notebook and the supporting files folder to my vm but I get an error in model.fit:

Epoch 1/100
2020-12-21 22:21:41.411410: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-12-21 22:21:41.742131: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-12-21 22:21:42.059676: F tensorflow/stream_executor/cuda/cuda_dnn.cc:1186] Check failed: cudnnSetRNNMatrixMathType(rnn_desc.get(), math_type) == CUDNN_STATUS_SUCCESS (3 vs. 0)
Aborted

nectario · 2020-12-21T23:13:24Z

Not sure if this is relevant to CPU as it takes 10 times longer, and I never measured the CPU times from 2.3.0 vs 2.4.0.

I am not sure about the above error. Are you using the correct CUDA/CUDNN versions?

nectario · 2020-12-21T23:13:50Z

With GPU is 1 second vs 3 seconds.

nikitamaia · 2020-12-22T00:08:02Z

Hmm not sure what is causing the Check failed error message.

At any rate, I ran the code with 2.4 and I'm seeing 2 seconds:

nectario · 2020-12-22T00:33:30Z

It used to be 1 second. That is still 50% more. Can you run it with 2.3?

nectario · 2020-12-22T16:29:09Z

The key thing with this issue is the timings between 2.3 and 2.4. If both timings are equal, there is no issue. If 2.3 is faster, that is the issue that needs to be looked into.

nikitamaia · 2020-12-22T18:03:04Z

Yes, I am trying to run it with 2.3 but I'm getting the error message I shared earlier. I need to debug that first.

nectario · 2020-12-22T18:08:11Z

CUDA and CUDNN versions for 2.3 are different.

nectario · 2020-12-22T18:09:26Z

Btw, it has been very difficult to train this model till completion. Was getting similar issues you are getting which I posted in a different thread a few months back.

nectario · 2020-12-23T15:31:29Z

Adding this code at the start, dropped my per epoch time to 1 second!

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

ahtik · 2020-12-23T18:26:21Z

@nectario I ran the colab code locally with both TF2.3.1 and TF2.4 and interestingly enough didn't observe that much of a difference.

Original (batch_size=600 to fit my gpu)
tf2.4
2s epoch, 316ms/step

tf2.3.1
2s epoch, 295ms/step

Mixed Precision (batch_size=620)
tf2.4
1s epoch, ~190ms/step
1s epoch, ~170ms/step (increased LSTM unit from 150 to 152 to have x%8=0)

tf2.3.1
1s epoch, ~232ms/step (default XLA settings)
1s epoch, ~171ms/step (default XLA settings, increased LSTM unit from 150 to 152 to have x%8=0)
2s epoch, ~235ms/step (XLA disabled with tf.config.optimizer.set_jit(False))

It looks like one thing different between TF2.4 and TF2.3.1 is XLA no longer being enabled by default. Although not sure how much of it's automated magic was enabled in 2.3.1. This can be seen also from the TF "startup" log. Windows and XLA in TF2.4 doesn't seem to play along that well at the moment (TF_XLA_FLAGS=--tf_xla_enable_xla_devices --tf_xla_auto_jit=fusible causing tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0 and eventual hanging).

TF2.3.1: Win 10, graphics driver, 460.89, CUDA 10.1, cudnn-7.6.5.32
TF2.4: Win 10, graphics driver, 460.89, CUDA 11.0, cudnn-8.0.5.39

Possibly not relevant but mentioning just to be sure: I do have Hardware-accelerated GPU scheduling enabled in Windows.
https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/ (having it off was claimed to be causing some other issues with the newer cards at #45716)

nectario · 2020-12-23T19:08:55Z

Nice, thank you @ahtik This is interesting.

I just ran a 1500 epochs training and finished fine with no crashing with TF 2.4 (this is in reference to the crashing issue). Although, each epoch takes 3 seconds for me and has all your optimizations (mixed precision and LSTM output at 152).

Epoch 1500/1500
7/7 - 3s - loss: 5.3210e-04 - daily_loss: 2.0926e-05 - weekly_loss: 4.1037e-05 - monthly_loss: 6.4028e-05 - three_month_loss: 1.6983e-04 - yearly_loss: 2.0743e-04 - direction_loss: 4.8010e-05 - daily_rmse: 0.0046 - daily_direction_accuracy: 0.8102 - weekly_rmse: 0.0064 - weekly_direction_accuracy: 0.8785 - monthly_rmse: 0.0080 - monthly_direction_accuracy: 0.9457 - three_month_rmse: 0.0130 - three_month_direction_accuracy: 0.9472 - yearly_rmse: 0.0144 - yearly_direction_accuracy: 0.9809 - direction_daily_accuracy: 1.0000 - val_loss: 3.2617 - val_daily_loss: 2.3174e-04 - val_weekly_loss: 7.3195e-04 - val_monthly_loss: 0.0015 - val_three_month_loss: 9.3365e-04 - val_yearly_loss: 0.0014 - val_direction_loss: 5.4297 - val_daily_rmse: 0.0152 - val_daily_direction_accuracy: 0.4769 - val_weekly_rmse: 0.0271 - val_weekly_direction_accuracy: 0.5692 - val_monthly_rmse: 0.0393 - val_monthly_direction_accuracy: 0.7077 - val_three_month_rmse: 0.0306 - val_three_month_direction_accuracy: 0.9077 - val_yearly_rmse: 0.0376 - val_yearly_direction_accuracy: 0.8615 - val_direction_daily_accuracy: 0.5385
{'loss': 0.0005321009666658938, 'daily_loss': 2.0926070646964945e-05, 'weekly_loss': 4.1037234041141346e-05, 'monthly_loss': 6.402765575330704e-05, 'three_month_loss': 0.0001698337437119335, 'yearly_loss': 0.0002074312506010756, 'direction_loss': 4.8009744205046445e-05, 'daily_rmse': 0.004576211795210838, 'daily_direction_accuracy': 0.8101887702941895, 'weekly_rmse': 0.006405814550817013, 'weekly_direction_accuracy': 0.8784587383270264, 'monthly_rmse': 0.008001040667295456, 'monthly_direction_accuracy': 0.9456943273544312, 'three_month_rmse': 0.013031532056629658, 'three_month_direction_accuracy': 0.9472459554672241, 'yearly_rmse': 0.014403595589101315, 'yearly_direction_accuracy': 0.9808636903762817, 'direction_daily_accuracy': 1.0, 'val_loss': 3.26171875, 'val_daily_loss': 0.00023174285888671875, 'val_weekly_loss': 0.0007319450378417969, 'val_monthly_loss': 0.0015411376953125, 'val_three_month_loss': 0.0009336471557617188, 'val_yearly_loss': 0.0014104843139648438, 'val_direction_loss': 5.4296875, 'val_daily_rmse': 0.015223219990730286, 'val_daily_direction_accuracy': 0.4769230782985687, 'val_weekly_rmse': 0.027050945907831192, 'val_weekly_direction_accuracy': 0.5692307949066162, 'val_monthly_rmse': 0.039255402982234955, 'val_monthly_direction_accuracy': 0.7076923251152039, 'val_three_month_rmse': 0.03055468387901783, 'val_three_month_direction_accuracy': 0.9076923131942749, 'val_yearly_rmse': 0.037555813789367676, 'val_yearly_direction_accuracy': 0.8615384697914124, 'val_direction_daily_accuracy': 0.5384615659713745}
Weights File: /tmp/validation_weights/DJI/val_daily_return_dir_accuracy-0.615_val_weekly_return_dir_accuracy-0.569_val_monthly_return_dir_accuracy-0.677_val_three_month_return_dir_accuracy-0.923_val_daily_rmse-0.014_val_direction_accuracy-0.585_epoch-1059.h5

nectario · 2020-12-23T19:38:07Z

@ahtik Now it seems with or without your optimization suggestions, it takes 3 seconds per epoch.

ahtik · 2020-12-23T19:40:42Z

Is this 100% the same code as in colab, other than the renamed losses? Output looks a bit different, for example no step time info?

My 150 epoch run finished in ~4 minutes (0:04:13), (mixed precision, LSTM with 152, tf2.4, everything else default):

Epoch 150/150
7/7 [==============================] - 1s 175ms/step - loss: -3.8389 - dense_loss: 2.2464e-04 - dense_1_loss: 5.3154e-04 - dense_2_loss: 6.7071e-04 - dense_3_loss: 7.6445e-04 - dense_4_loss: 8.4391e-04 - dense_5_loss: -6.3981 - dense_rmse: 0.0150 - dense_direction_accuracy: 0.5012 - dense_1_rmse: 0.0231 - dense_1_direction_accuracy: 0.6162 - dense_2_rmse: 0.0259 - dense_2_direction_accuracy: 0.8292 - dense_3_rmse: 0.0276 - dense_3_direction_accuracy: 0.9069 - dense_4_rmse: 0.0290 - dense_4_direction_accuracy: 0.9085 - dense_5_daily_accuracy: 0.0000e+00

RTX 2070 Super

nectario · 2020-12-23T19:43:53Z

@ahtik The code in colab is different than the other code. This one is a bit more elaborate but same core architecture.

nectario · 2020-12-23T19:46:04Z

@ahtik I renamed the outputs to more generic names.

nectario · 2020-12-23T19:48:36Z

new_data.zip

@ahtik With this data, I get 3 seconds per epoch.

ahtik · 2020-12-23T20:13:51Z

With the new_data, batch_size=600 (to fit my gpu in all scenarios), 150 epochs, RTX2070S.

I'm a bit curious, why your log output does not show step timing like 177ms/step?

TF 2.4, LSTM 152, Mixed Precision, ~4.25 minutes (0:04:14)
Epoch 150/150
7/7 [==============================] - 1s 177ms/step - loss: -3.9754 - dense_loss: 1.9943e-04 - dense_1_loss: 4.2309e-04 - dense_2_loss: 5.6400e-04 - dense_3_loss: 6.8907e-04 - dense_4_loss: 6.8543e-04 - dense_5_loss: -6.6267 - dense_rmse: 0.0141 - dense_direction_accuracy: 0.5165 - dense_1_rmse: 0.0206 - dense_1_direction_accuracy: 0.6749 - dense_2_rmse: 0.0237 - dense_2_direction_accuracy: 0.8393 - dense_3_rmse: 0.0262 - dense_3_direction_accuracy: 0.9117 - dense_4_rmse: 0.0262 - dense_4_direction_accuracy: 0.9108 - dense_5_daily_accuracy: 0.0000e+00

TF 2.4, LSTM 152, Regular ("unmixed"), ~6.75 minutes (0:06:46)
Epoch 150/150
7/7 [==============================] - 2s 327ms/step - loss: -3.7065 - dense_loss: 2.2837e-04 - dense_1_loss: 5.4915e-04 - dense_2_loss: 6.1801e-04 - dense_3_loss: 0.0012 - dense_4_loss: 0.0011 - dense_5_loss: -6.1837 - dense_rmse: 0.0151 - dense_direction_accuracy: 0.4959 - dense_1_rmse: 0.0234 - dense_1_direction_accuracy: 0.6139 - dense_2_rmse: 0.0249 - dense_2_direction_accuracy: 0.8466 - dense_3_rmse: 0.0343 - dense_3_direction_accuracy: 0.8807 - dense_4_rmse: 0.0336 - dense_4_direction_accuracy: 0.8896 - dense_5_daily_accuracy: 0.0000e+00

nectario · 2020-12-23T21:06:28Z

@ahtik I have never seen the ms portion when I train. Don't ever remember seeing it. I just see the seconds portion...

ahtik · 2020-12-23T21:09:02Z

TF 2.3.1, LSTM 152, Mixed Precision, ~4.5 minutes (0:04:29)
Epoch 150/150
7/7 [==============================] - 1s 164ms/step - loss: -3.7234 - dense_loss: 2.0466e-04 - dense_1_loss: 5.1873e-04 - dense_2_loss: 5.6820e-04 - dense_3_loss: 8.1529e-04 - dense_4_loss: 7.8634e-04 - dense_5_loss: -6.2101 - dense_rmse: 0.0143 - dense_direction_accuracy: 0.4912 - dense_1_rmse: 0.0228 - dense_1_direction_accuracy: 0.6092 - dense_2_rmse: 0.0238 - dense_2_direction_accuracy: 0.8431 - dense_3_rmse: 0.0286 - dense_3_direction_accuracy: 0.9141 - dense_4_rmse: 0.0280 - dense_4_direction_accuracy: 0.9176 - dense_5_daily_accuracy: 0.0000e+00

TF 2.3.1, LSTM 152, Regular ("unmixed"), ~6 minutes
Epoch 150/150
7/7 [==============================] - 2s 303ms/step - loss: -3.7612 - dense_loss: 1.9700e-04 - dense_1_loss: 4.7340e-04 - dense_2_loss: 6.3550e-04 - dense_3_loss: 8.6339e-04 - dense_4_loss: 7.9784e-04 - dense_5_loss: -6.2736 - dense_rmse: 0.0140 - dense_direction_accuracy: 0.5128 - dense_1_rmse: 0.0218 - dense_1_direction_accuracy: 0.6382 - dense_2_rmse: 0.0252 - dense_2_direction_accuracy: 0.8350 - dense_3_rmse: 0.0294 - dense_3_direction_accuracy: 0.9080 - dense_4_rmse: 0.0282 - dense_4_direction_accuracy: 0.9141 - dense_5_daily_accuracy: 0.0000e+00

I wish I knew what else to look for to figure out the difference in setup?

nectario · 2020-12-23T21:15:28Z

@ahtik You have an RTX 3070 and I have a Titan V. It may be because the RTX has big differences in the architecture compared to the older Titan V.

ahtik · 2020-12-23T21:28:50Z

Mine is RTX 2070 Super, a bit older than 3070. Compute capability 7.5. I do have gtx 1080 with Ubuntu somewhere on the cloud...

…

On December 23, 2020 10:15:47 PM Nektarios Kalogridis ***@***.***> wrote: @ahtik You have an RTX 3070 and I have a Titan V. It may be because the RTX has big differences in the architecture compared to the older Titan V. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

sanjoy · 2020-12-23T21:36:52Z

Adding this code at the start, dropped my per epoch time to 1 second!

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Is it possible that you were using mixed precision in 2.3 but not in 2.4?

CC @reedwm

nectario · 2020-12-23T22:09:06Z

@sanjoy Actually I take it back. At the moment with that enabled and not, I get 3 seconds per epoch. There was a discrepancy in the amount of data earlier when I observed this. I used the same data when I measured this now which is this:
new_data.zip

nectario · 2020-12-23T23:29:49Z

@sanjoy @ahtik

I just downgraded to TF 2.3.1 with CUDA 10.1/CUDNN 7.6

Training the same data takes 2 seconds per epoch versus 3 seconds per epoch on TF 2.4

So:

TF 2.4.0: 3 seconds
TF 2.3.1: 2 seconds

Both setups were identical. I simply downgraded the TF and CUDA versions.

Epoch 198/1500
{'loss': 0.4114486277103424, 'daily_loss': 0.00019866823276970536, 'weekly_loss': 0.0008702529012225568, 'monthly_loss': 0.0005495575023815036, 'three_month_loss': 0.0010519209317862988, 'yearly_loss': 0.0016778710996732116, 'direction_loss': 0.6785006523132324, 'daily_rmse': 0.014094972051680088, 'daily_direction_accuracy': 0.49185416102409363, 'weekly_rmse': 0.029500048607587814, 'weekly_direction_accuracy': 0.4587535560131073, 'monthly_rmse': 0.023442642763257027, 'monthly_direction_accuracy': 0.839927613735199, 'three_month_rmse': 0.03243333101272583, 'three_month_direction_accuracy': 0.8919058442115784, 'yearly_rmse': 0.04096182435750961, 'yearly_direction_accuracy': 0.9420739412307739, 'direction_daily_accuracy': 0.5637444853782654, 'val_loss': 0.42898598313331604, 'val_daily_loss': 0.00019334981334395707, 'val_weekly_loss': 0.0019688063766807318, 'val_monthly_loss': 0.0023292433470487595, 'val_three_month_loss': 0.001890965853817761, 'val_yearly_loss': 0.0008342369110323489, 'val_direction_loss': 0.7029489278793335, 'val_daily_rmse': 0.013905028812587261, 'val_daily_direction_accuracy': 0.446153849363327, 'val_weekly_rmse': 0.04437123239040375, 'val_weekly_direction_accuracy': 0.3384615480899811, 'val_monthly_rmse': 0.04826223477721214, 'val_monthly_direction_accuracy': 0.6769230961799622, 'val_three_month_rmse': 0.043485235422849655, 'val_three_month_direction_accuracy': 0.892307698726654, 'val_yearly_rmse': 0.028883159160614014, 'val_yearly_direction_accuracy': 0.8153846263885498, 'val_direction_daily_accuracy': 0.5076923370361328}
7/7 - 2s - loss: 0.4114 - daily_loss: 1.9867e-04 - weekly_loss: 8.7025e-04 - monthly_loss: 5.4956e-04 - three_month_loss: 0.0011 - yearly_loss: 0.0017 - direction_loss: 0.6785 - daily_rmse: 0.0141 - daily_direction_accuracy: 0.4919 - weekly_rmse: 0.0295 - weekly_direction_accuracy: 0.4588 - monthly_rmse: 0.0234 - monthly_direction_accuracy: 0.8399 - three_month_rmse: 0.0324 - three_month_direction_accuracy: 0.8919 - yearly_rmse: 0.0410 - yearly_direction_accuracy: 0.9421 - direction_daily_accuracy: 0.5637 - val_loss: 0.4290 - val_daily_loss: 1.9335e-04 - val_weekly_loss: 0.0020 - val_monthly_loss: 0.0023 - val_three_month_loss: 0.0019 - val_yearly_loss: 8.3424e-04 - val_direction_loss: 0.7029 - val_daily_rmse: 0.0139 - val_daily_direction_accuracy: 0.4462 - val_weekly_rmse: 0.0444 - val_weekly_direction_accuracy: 0.3385 - val_monthly_rmse: 0.0483 - val_monthly_direction_accuracy: 0.6769 - val_three_month_rmse: 0.0435 - val_three_month_direction_accuracy: 0.8923 - val_yearly_rmse: 0.0289 - val_yearly_direction_accuracy: 0.8154 - val_direction_daily_accuracy: 0.5077
Epoch 199/1500
{'loss': 0.41261425614356995, 'daily_loss': 0.00021032587392255664, 'weekly_loss': 0.000582002685405314, 'monthly_loss': 0.0005726460367441177, 'three_month_loss': 0.0009110110113397241, 'yearly_loss': 0.0015689559513702989, 'direction_loss': 0.681282103061676, 'daily_rmse': 0.014502615667879581, 'daily_direction_accuracy': 0.5011637210845947, 'weekly_rmse': 0.024124732241034508, 'weekly_direction_accuracy': 0.5919317007064819, 'monthly_rmse': 0.023930024355649948, 'monthly_direction_accuracy': 0.844840943813324, 'three_month_rmse': 0.030182959511876106, 'three_month_direction_accuracy': 0.8952676653862, 'yearly_rmse': 0.03961004689335823, 'yearly_direction_accuracy': 0.9475045204162598, 'direction_daily_accuracy': 0.5549521446228027, 'val_loss': 0.412332147359848, 'val_daily_loss': 0.00014936340448912233, 'val_weekly_loss': 0.0008999488200061023, 'val_monthly_loss': 0.002337433397769928, 'val_three_month_loss': 0.0011293195420876145, 'val_yearly_loss': 0.0006307148723863065, 'val_direction_loss': 0.6786422729492188, 'val_daily_rmse': 0.01222143229097128, 'val_daily_direction_accuracy': 0.4615384638309479, 'val_weekly_rmse': 0.0299991462379694, 'val_weekly_direction_accuracy': 0.5846154093742371, 'val_monthly_rmse': 0.04834701120853424, 'val_monthly_direction_accuracy': 0.6153846383094788, 'val_three_month_rmse': 0.03360534831881523, 'val_three_month_direction_accuracy': 1.0, 'val_yearly_rmse': 0.025114037096500397, 'val_yearly_direction_accuracy': 0.9076923131942749, 'val_direction_daily_accuracy': 0.5692307949066162}
7/7 - 2s - loss: 0.4126 - daily_loss: 2.1033e-04 - weekly_loss: 5.8200e-04 - monthly_loss: 5.7265e-04 - three_month_loss: 9.1101e-04 - yearly_loss: 0.0016 - direction_loss: 0.6813 - daily_rmse: 0.0145 - daily_direction_accuracy: 0.5012 - weekly_rmse: 0.0241 - weekly_direction_accuracy: 0.5919 - monthly_rmse: 0.0239 - monthly_direction_accuracy: 0.8448 - three_month_rmse: 0.0302 - three_month_direction_accuracy: 0.8953 - yearly_rmse: 0.0396 - yearly_direction_accuracy: 0.9475 - direction_daily_accuracy: 0.5550 - val_loss: 0.4123 - val_daily_loss: 1.4936e-04 - val_weekly_loss: 8.9995e-04 - val_monthly_loss: 0.0023 - val_three_month_loss: 0.0011 - val_yearly_loss: 6.3071e-04 - val_direction_loss: 0.6786 - val_daily_rmse: 0.0122 - val_daily_direction_accuracy: 0.4615 - val_weekly_rmse: 0.0300 - val_weekly_direction_accuracy: 0.5846 - val_monthly_rmse: 0.0483 - val_monthly_direction_accuracy: 0.6154 - val_three_month_rmse: 0.0336 - val_three_month_direction_accuracy: 1.0000 - val_yearly_rmse: 0.0251 - val_yearly_direction_accuracy: 0.9077 - val_direction_daily_accuracy: 0.5692
Epoch 200/1500
{'loss': 0.417228639125824, 'daily_loss': 0.00025663431733846664, 'weekly_loss': 0.0010450570844113827, 'monthly_loss': 0.0007138350629247725, 'three_month_loss': 0.0010068160481750965, 'yearly_loss': 0.002297411672770977, 'direction_loss': 0.686514675617218, 'daily_rmse': 0.016019809991121292, 'daily_direction_accuracy': 0.49211275577545166, 'weekly_rmse': 0.03232734277844429, 'weekly_direction_accuracy': 0.5898629426956177, 'monthly_rmse': 0.026717692613601685, 'monthly_direction_accuracy': 0.7913110852241516, 'three_month_rmse': 0.03173036500811577, 'three_month_direction_accuracy': 0.8929402828216553, 'yearly_rmse': 0.04793132096529007, 'yearly_direction_accuracy': 0.9330230355262756, 'direction_daily_accuracy': 0.5358158946037292, 'val_loss': 0.4141582250595093, 'val_daily_loss': 0.00014011705934535712, 'val_weekly_loss': 0.0018782862462103367, 'val_monthly_loss': 0.0017431046580895782, 'val_three_month_loss': 0.0013259940315037966, 'val_yearly_loss': 0.0009762737900018692, 'val_direction_loss': 0.6801573634147644, 'val_daily_rmse': 0.011837105266749859, 'val_daily_direction_accuracy': 0.4769230782985687, 'val_weekly_rmse': 0.04333920031785965, 'val_weekly_direction_accuracy': 0.32307693362236023, 'val_monthly_rmse': 0.041750505566596985, 'val_monthly_direction_accuracy': 0.6769230961799622, 'val_three_month_rmse': 0.03641420230269432, 'val_three_month_direction_accuracy': 1.0, 'val_yearly_rmse': 0.031245380640029907, 'val_yearly_direction_accuracy': 0.8153846263885498, 'val_direction_daily_accuracy': 0.5538461804389954}
7/7 - 2s - loss: 0.4172 - daily_loss: 2.5663e-04 - weekly_loss: 0.0010 - monthly_loss: 7.1384e-04 - three_month_loss: 0.0010 - yearly_loss: 0.0023 - direction_loss: 0.6865 - daily_rmse: 0.0160 - daily_direction_accuracy: 0.4921 - weekly_rmse: 0.0323 - weekly_direction_accuracy: 0.5899 - monthly_rmse: 0.0267 - monthly_direction_accuracy: 0.7913 - three_month_rmse: 0.0317 - three_month_direction_accuracy: 0.8929 - yearly_rmse: 0.0479 - yearly_direction_accuracy: 0.9330 - direction_daily_accuracy: 0.5358 - val_loss: 0.4142 - val_daily_loss: 1.4012e-04 - val_weekly_loss: 0.0019 - val_monthly_loss: 0.0017 - val_three_month_loss: 0.0013 - val_yearly_loss: 9.7627e-04 - val_direction_loss: 0.6802 - val_daily_rmse: 0.0118 - val_daily_direction_accuracy: 0.4769 - val_weekly_rmse: 0.0433 - val_weekly_direction_accuracy: 0.3231 - val_monthly_rmse: 0.0418 - val_monthly_direction_accuracy: 0.6769 - val_three_month_rmse: 0.0364 - val_three_month_direction_accuracy: 1.0000 - val_yearly_rmse: 0.0312 - val_yearly_direction_accuracy: 0.8154 - val_direction_daily_accuracy: 0.5538
Epoch 201/1500
{'loss': 0.41835010051727295, 'daily_loss': 0.0002527496835682541, 'weekly_loss': 0.0012075683334842324, 'monthly_loss': 0.0007212499622255564, 'three_month_loss': 0.001039160997606814, 'yearly_loss': 0.004198853857815266, 'direction_loss': 0.6848840713500977, 'daily_rmse': 0.015898102894425392, 'daily_direction_accuracy': 0.5024566650390625, 'weekly_rmse': 0.034750085324048996, 'weekly_direction_accuracy': 0.4915955662727356, 'monthly_rmse': 0.02685609646141529, 'monthly_direction_accuracy': 0.826997697353363, 'three_month_rmse': 0.032236021012067795, 'three_month_direction_accuracy': 0.887251079082489, 'yearly_rmse': 0.06479856371879578, 'yearly_direction_accuracy': 0.8704422116279602, 'direction_daily_accuracy': 0.5492630004882812, 'val_loss': 0.416068434715271, 'val_daily_loss': 0.00016652456542942673, 'val_weekly_loss': 0.0010100970976054668, 'val_monthly_loss': 0.002100505866110325, 'val_three_month_loss': 0.0014713642885908484, 'val_yearly_loss': 0.0015848802868276834, 'val_direction_loss': 0.6828917264938354, 'val_daily_rmse': 0.012904440052807331, 'val_daily_direction_accuracy': 0.446153849363327, 'val_weekly_rmse': 0.03178202360868454, 'val_weekly_direction_accuracy': 0.4307692348957062, 'val_monthly_rmse': 0.04583127424120903, 'val_monthly_direction_accuracy': 0.6461538672447205, 'val_three_month_rmse': 0.03835836797952652, 'val_three_month_direction_accuracy': 0.9692307710647583, 'val_yearly_rmse': 0.039810553193092346, 'val_yearly_direction_accuracy': 0.9076923131942749, 'val_direction_daily_accuracy': 0.5538461804389954}
7/7 - 2s - loss: 0.4184 - daily_loss: 2.5275e-04 - weekly_loss: 0.0012 - monthly_loss: 7.2125e-04 - three_month_loss: 0.0010 - yearly_loss: 0.0042 - direction_loss: 0.6849 - daily_rmse: 0.0159 - daily_direction_accuracy: 0.5025 - weekly_rmse: 0.0348 - weekly_direction_accuracy: 0.4916 - monthly_rmse: 0.0269 - monthly_direction_accuracy: 0.8270 - three_month_rmse: 0.0322 - three_month_direction_accuracy: 0.8873 - yearly_rmse: 0.0648 - yearly_direction_accuracy: 0.8704 - direction_daily_accuracy: 0.5493 - val_loss: 0.4161 - val_daily_loss: 1.6652e-04 - val_weekly_loss: 0.0010 - val_monthly_loss: 0.0021 - val_three_month_loss: 0.0015 - val_yearly_loss: 0.0016 - val_direction_loss: 0.6829 - val_daily_rmse: 0.0129 - val_daily_direction_accuracy: 0.4462 - val_weekly_rmse: 0.0318 - val_weekly_direction_accuracy: 0.4308 - val_monthly_rmse: 0.0458 - val_monthly_direction_accuracy: 0.6462 - val_three_month_rmse: 0.0384 - val_three_month_direction_accuracy: 0.9692 - val_yearly_rmse: 0.0398 - val_yearly_direction_accuracy: 0.9077 - val_direction_daily_accuracy: 0.5538
Epoch 202/1500

sanjoy · 2020-12-29T00:37:03Z

@nectario, have you tried collecting TensorBoard profiles for 2.4 and 2.3 to see if something jumps out?

nectario · 2020-12-29T01:46:09Z

@nectario, have you tried collecting TensorBoard profiles for 2.4 and 2.3 to see if something jumps out?

I can try doing this.

nectario · 2021-01-18T23:16:54Z

Is there any luck with this issue? Will it be addressed in any of the upcoming releases?

raphyphy · 2021-04-01T12:12:31Z

Hello, this is also a relevant issue to me. I have a personal rtx 2070 super and each epoch takes around 5s

these are the logs

2021-04-01 20:03:46.712589: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2.4.1
2021-04-01 20:03:48.725664: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-01 20:03:48.727030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-01 20:03:48.759345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 Super computeCapability: 7.5
coreClock: 1.38GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-04-01 20:03:48.759725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-01 20:03:48.765863: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-01 20:03:48.766005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-01 20:03:48.768815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-01 20:03:48.769891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-01 20:03:48.774016: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-01 20:03:48.776639: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-01 20:03:48.777397: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-01 20:03:48.777662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-01 20:03:48.778345: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-01 20:03:48.779352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 Super computeCapability: 7.5
coreClock: 1.38GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-04-01 20:03:48.779854: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-01 20:03:48.780161: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-01 20:03:48.780471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-01 20:03:48.780708: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-01 20:03:48.780930: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-01 20:03:48.781206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-01 20:03:48.781430: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-01 20:03:48.781575: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-01 20:03:48.781744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-01 20:03:49.268423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-01 20:03:49.268576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-04-01 20:03:49.268669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-04-01 20:03:49.268882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6611 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 Super, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-04-01 20:03:49.269599: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
...
1875/1875 [==============================] - 7s 3ms/step - loss: 0.5143 - accuracy: 0.8183
Epoch 2/100
1875/1875 [==============================] - 5s 3ms/step - loss: 0.2595 - accuracy: 0.9065
Epoch 3/100
....

nectario added the type:performance Performance Issue label Dec 14, 2020

google-ml-butler bot assigned ravikyram Dec 14, 2020

ravikyram added stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 labels Dec 15, 2020

ravikyram added comp:gpu GPU related issues and removed stat:awaiting response Status - Awaiting response from author labels Dec 16, 2020

ravikyram assigned ymodak and unassigned ravikyram Dec 16, 2020

ymodak assigned nikitamaia and unassigned ymodak Dec 18, 2020

ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 18, 2020

nikitamaia added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 22, 2020

nikitamaia assigned sanjoy Dec 22, 2020

nikitamaia removed their assignment Dec 22, 2020

nikitamaia added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Dec 22, 2020

nectario mentioned this issue Dec 22, 2020

failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_INTERNAL_ERROR #41169

Closed

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 25, 2020

geetachavan1 added the regression issue To spot regression issues in latest version label Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

Comments