[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3 #45676

Open
nectario opened this issue Dec 14, 2020 · 36 comments
Assignees
Labels
comp:gpu GPU related issues regression issue To spot regression issues in latest version TF 2.4 for issues related to TF 2.4 type:performance Performance Issue

Comments

@nectario
Copy link

Tensorflow 2.4 takes 3 seconds per epoch during training versus 1 second with TensorFlow 2.3

Why is is slower??

When I train, I get this standard log (in case it's helful)

2020-12-14 18:34:49.552097: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2020-12-14 18:34:49.572089: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2020-12-14 18:34:49.671571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:49.671911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:50.350471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:50.350673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:50.456348: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:50.513180: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:50.881235: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.192188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.216230: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.216525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:51.627424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:83:00.0 name: TITAN V computeCapability: 7.0 coreClock: 1.455GHz coreCount: 80 deviceMemorySize: 12.00GiB deviceMemoryBandwidth: 607.97GiB/s 2020-12-14 18:34:51.627803: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll 2020-12-14 18:34:51.627987: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2020-12-14 18:34:51.628149: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2020-12-14 18:34:51.628318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll 2020-12-14 18:34:51.628495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll 2020-12-14 18:34:51.628657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll 2020-12-14 18:34:51.628843: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll 2020-12-14 18:34:51.629021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 2020-12-14 18:34:51.629239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2020-12-14 18:34:52.792115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-12-14 18:34:52.792321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2020-12-14 18:34:52.792441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2020-12-14 18:34:52.792890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10243 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:83:00.0, compute capability: 7.0) 2020-12-14 18:34:52.830535: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below):
  • Python version: 3.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.0 / 8.02
  • GPU model and memory: NVIDIA Titan V 12 GB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

@nectario nectario added the type:performance Performance Issue label Dec 14, 2020
@ravikyram
Copy link
Contributor

@nectario

Please, share colab link or simple standalone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks!

@ravikyram ravikyram added stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 labels Dec 15, 2020
@nectario
Copy link
Author

@ravikyram

Here is a Google Colab with the code that exhibits this issue:

https://colab.research.google.com/drive/1E_IZ3zmPrQHbuiYl47gvL591hHfS-VkN#scrollTo=L5nAouOwIcnE

Please Note: This needs to run on a V100 GPU to recreate this issue. With TF 2.3 each epoch takes about 1 second versus 3 seconds with TF 2.4

@nectario
Copy link
Author

Going from 1 second per epoch to 3 seconds, is a huge deal. Especially since my model is trained on AWS, which means I now experienced a 300% rise in costs...

@ravikyram
Copy link
Contributor

@nectario

Please, grant me the access for colab link. Thanks!

@nectario
Copy link
Author

Just did!

@ravikyram
Copy link
Contributor

@nectario
Will it be possible to share supporting data files(Eg:stock_data_df.pkl ) to reproduce the issue.Thanks!

@nectario
Copy link
Author

Here you go:

supporting_files.zip

Unzip the file to get all supporting files.

@ravikyram ravikyram added comp:gpu GPU related issues and removed stat:awaiting response Status - Awaiting response from author labels Dec 16, 2020
@ravikyram ravikyram assigned ymodak and unassigned ravikyram Dec 16, 2020
@ymodak ymodak assigned nikitamaia and unassigned ymodak Dec 18, 2020
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 18, 2020
@nikitamaia
Copy link
Member

Hi @nectario, does this reproduce on CPU also? Or just GPU?
Additionally, I tried to reproduce this with 1 V100. I copied over the notebook and the supporting files folder to my vm but I get an error in model.fit:

Epoch 1/100
2020-12-21 22:21:41.411410: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-12-21 22:21:41.742131: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-12-21 22:21:42.059676: F tensorflow/stream_executor/cuda/cuda_dnn.cc:1186] Check failed: cudnnSetRNNMatrixMathType(rnn_desc.get(), math_type) == CUDNN_STATUS_SUCCESS (3 vs. 0)
Aborted

@nectario
Copy link
Author

Not sure if this is relevant to CPU as it takes 10 times longer, and I never measured the CPU times from 2.3.0 vs 2.4.0.

I am not sure about the above error. Are you using the correct CUDA/CUDNN versions?

@nectario
Copy link
Author

With GPU is 1 second vs 3 seconds.

@nikitamaia
Copy link
Member
nikitamaia commented Dec 22, 2020

Hmm not sure what is causing the Check failed error message.

At any rate, I ran the code with 2.4 and I'm seeing 2 seconds:

Screen Shot 2020-12-21 at 6 06 25 PM

@nikitamaia nikitamaia added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Dec 22, 2020
@nectario
Copy link
Author

It used to be 1 second. That is still 50% more. Can you run it with 2.3?

@nectario
Copy link
Author

The key thing with this issue is the timings between 2.3 and 2.4. If both timings are equal, there is no issue. If 2.3 is faster, that is the issue that needs to be looked into.

@nikitamaia
Copy link
Member

Yes, I am trying to run it with 2.3 but I'm getting the error message I shared earlier. I need to debug that first.

@nectario
Copy link
Author

CUDA and CUDNN versions for 2.3 are different.

@nectario
Copy link
Author

Btw, it has been very difficult to train this model till completion. Was getting similar issues you are getting which I posted in a different thread a few months back.

@nectario
Copy link
Author

Adding this code at the start, dropped my per epoch time to 1 second!

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

@ahtik
Copy link
Contributor
ahtik commented Dec 23, 2020

@nectario I ran the colab code locally with both TF2.3.1 and TF2.4 and interestingly enough didn't observe that much of a difference.

Original (batch_size=600 to fit my gpu)
tf2.4
2s epoch, 316ms/step

tf2.3.1
2s epoch, 295ms/step

Mixed Precision (batch_size=620)
tf2.4
1s epoch, ~190ms/step
1s epoch, ~170ms/step (increased LSTM unit from 150 to 152 to have x%8=0)

tf2.3.1
1s epoch, ~232ms/step (default XLA settings)
1s epoch, ~171ms/step (default XLA settings, increased LSTM unit from 150 to 152 to have x%8=0)
2s epoch, ~235ms/step (XLA disabled with tf.config.optimizer.set_jit(False))

It looks like one thing different between TF2.4 and TF2.3.1 is XLA no longer being enabled by default. Although not sure how much of it's automated magic was enabled in 2.3.1. This can be seen also from the TF "startup" log. Windows and XLA in TF2.4 doesn't seem to play along that well at the moment (TF_XLA_FLAGS=--tf_xla_enable_xla_devices --tf_xla_auto_jit=fusible causing tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0 and eventual hanging).

TF2.3.1: Win 10, graphics driver, 460.89, CUDA 10.1, cudnn-7.6.5.32
TF2.4: Win 10, graphics driver, 460.89, CUDA 11.0, cudnn-8.0.5.39

Possibly not relevant but mentioning just to be sure: I do have Hardware-accelerated GPU scheduling enabled in Windows.
https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/ (having it off was claimed to be causing some other issues with the newer cards at #45716)

@nectario
Copy link
Author

Nice, thank you @ahtik This is interesting.

I just ran a 1500 epochs training and finished fine with no crashing with TF 2.4 (this is in reference to the crashing issue). Although, each epoch takes 3 seconds for me and has all your optimizations (mixed precision and LSTM output at 152).

Epoch 1500/1500
7/7 - 3s - loss: 5.3210e-04 - daily_loss: 2.0926e-05 - weekly_loss: 4.1037e-05 - monthly_loss: 6.4028e-05 - three_month_loss: 1.6983e-04 - yearly_loss: 2.0743e-04 - direction_loss: 4.8010e-05 - daily_rmse: 0.0046 - daily_direction_accuracy: 0.8102 - weekly_rmse: 0.0064 - weekly_direction_accuracy: 0.8785 - monthly_rmse: 0.0080 - monthly_direction_accuracy: 0.9457 - three_month_rmse: 0.0130 - three_month_direction_accuracy: 0.9472 - yearly_rmse: 0.0144 - yearly_direction_accuracy: 0.9809 - direction_daily_accuracy: 1.0000 - val_loss: 3.2617 - val_daily_loss: 2.3174e-04 - val_weekly_loss: 7.3195e-04 - val_monthly_loss: 0.0015 - val_three_month_loss: 9.3365e-04 - val_yearly_loss: 0.0014 - val_direction_loss: 5.4297 - val_daily_rmse: 0.0152 - val_daily_direction_accuracy: 0.4769 - val_weekly_rmse: 0.0271 - val_weekly_direction_accuracy: 0.5692 - val_monthly_rmse: 0.0393 - val_monthly_direction_accuracy: 0.7077 - val_three_month_rmse: 0.0306 - val_three_month_direction_accuracy: 0.9077 - val_yearly_rmse: 0.0376 - val_yearly_direction_accuracy: 0.8615 - val_direction_daily_accuracy: 0.5385
{'loss': 0.0005321009666658938, 'daily_loss': 2.0926070646964945e-05, 'weekly_loss': 4.1037234041141346e-05, 'monthly_loss': 6.402765575330704e-05, 'three_month_loss': 0.0001698337437119335, 'yearly_loss': 0.0002074312506010756, 'direction_loss': 4.8009744205046445e-05, 'daily_rmse': 0.004576211795210838, 'daily_direction_accuracy': 0.8101887702941895, 'weekly_rmse': 0.006405814550817013, 'weekly_direction_accuracy': 0.8784587383270264, 'monthly_rmse': 0.008001040667295456, 'monthly_direction_accuracy': 0.9456943273544312, 'three_month_rmse': 0.013031532056629658, 'three_month_direction_accuracy': 0.9472459554672241, 'yearly_rmse': 0.014403595589101315, 'yearly_direction_accuracy': 0.9808636903762817, 'direction_daily_accuracy': 1.0, 'val_loss': 3.26171875, 'val_daily_loss': 0.00023174285888671875, 'val_weekly_loss': 0.0007319450378417969, 'val_monthly_loss': 0.0015411376953125, 'val_three_month_loss': 0.0009336471557617188, 'val_yearly_loss': 0.0014104843139648438, 'val_direction_loss': 5.4296875, 'val_daily_rmse': 0.015223219990730286, 'val_daily_direction_accuracy': 0.4769230782985687, 'val_weekly_rmse': 0.027050945907831192, 'val_weekly_direction_accuracy': 0.5692307949066162, 'val_monthly_rmse': 0.039255402982234955, 'val_monthly_direction_accuracy': 0.7076923251152039, 'val_three_month_rmse': 0.03055468387901783, 'val_three_month_direction_accuracy': 0.9076923131942749, 'val_yearly_rmse': 0.037555813789367676, 'val_yearly_direction_accuracy': 0.8615384697914124, 'val_direction_daily_accuracy': 0.5384615659713745}
Weights File: /tmp/validation_weights/DJI/val_daily_return_dir_accuracy-0.615_val_weekly_return_dir_accuracy-0.569_val_monthly_return_dir_accuracy-0.677_val_three_month_return_dir_accuracy-0.923_val_daily_rmse-0.014_val_direction_accuracy-0.585_epoch-1059.h5

@nectario
Copy link
Author

@ahtik Now it seems with or without your optimization suggestions, it takes 3 seconds per epoch.

@ahtik
Copy link
Contributor
ahtik commented Dec 23, 2020

Is this 100% the same code as in colab, other than the renamed losses? Output looks a bit different, for example no step time info?

My 150 epoch run finished in ~4 minutes (0:04:13), (mixed precision, LSTM with 152, tf2.4, everything else default):

Epoch 150/150
7/7 [==============================] - 1s 175ms/step - loss: -3.8389 - dense_loss: 2.2464e-04 - dense_1_loss: 5.3154e-04 - dense_2_loss: 6.7071e-04 - dense_3_loss: 7.6445e-04 - dense_4_loss: 8.4391e-04 - dense_5_loss: -6.3981 - dense_rmse: 0.0150 - dense_direction_accuracy: 0.5012 - dense_1_rmse: 0.0231 - dense_1_direction_accuracy: 0.6162 - dense_2_rmse: 0.0259 - dense_2_direction_accuracy: 0.8292 - dense_3_rmse: 0.0276 - dense_3_direction_accuracy: 0.9069 - dense_4_rmse: 0.0290 - dense_4_direction_accuracy: 0.9085 - dense_5_daily_accuracy: 0.0000e+00

RTX 2070 Super

@nectario
Copy link
Author
nectario commented Dec 23, 2020

@ahtik The code in colab is different than the other code. This one is a bit more elaborate but same core architecture.

@nectario
Copy link
Author

@ahtik I renamed the outputs to more generic names.

@nectario
Copy link
Author

new_data.zip

@ahtik With this data, I get 3 seconds per epoch.

@ahtik
Copy link
Contributor
ahtik commented Dec 23, 2020

With the new_data, batch_size=600 (to fit my gpu in all scenarios), 150 epochs, RTX2070S.

I'm a bit curious, why your log output does not show step timing like 177ms/step?

TF 2.4, LSTM 152, Mixed Precision, ~4.25 minutes (0:04:14)
Epoch 150/150
7/7 [==============================] - 1s 177ms/step - loss: -3.9754 - dense_loss: 1.9943e-04 - dense_1_loss: 4.2309e-04 - dense_2_loss: 5.6400e-04 - dense_3_loss: 6.8907e-04 - dense_4_loss: 6.8543e-04 - dense_5_loss: -6.6267 - dense_rmse: 0.0141 - dense_direction_accuracy: 0.5165 - dense_1_rmse: 0.0206 - dense_1_direction_accuracy: 0.6749 - dense_2_rmse: 0.0237 - dense_2_direction_accuracy: 0.8393 - dense_3_rmse: 0.0262 - dense_3_direction_accuracy: 0.9117 - dense_4_rmse: 0.0262 - dense_4_direction_accuracy: 0.9108 - dense_5_daily_accuracy: 0.0000e+00

TF 2.4, LSTM 152, Regular ("unmixed"), ~6.75 minutes (0:06:46)
Epoch 150/150
7/7 [==============================] - 2s 327ms/step - loss: -3.7065 - dense_loss: 2.2837e-04 - dense_1_loss: 5.4915e-04 - dense_2_loss: 6.1801e-04 - dense_3_loss: 0.0012 - dense_4_loss: 0.0011 - dense_5_loss: -6.1837 - dense_rmse: 0.0151 - dense_direction_accuracy: 0.4959 - dense_1_rmse: 0.0234 - dense_1_direction_accuracy: 0.6139 - dense_2_rmse: 0.0249 - dense_2_direction_accuracy: 0.8466 - dense_3_rmse: 0.0343 - dense_3_direction_accuracy: 0.8807 - dense_4_rmse: 0.0336 - dense_4_direction_accuracy: 0.8896 - dense_5_daily_accuracy: 0.0000e+00

@nectario
Copy link
Author

@ahtik I have never seen the ms portion when I train. Don't ever remember seeing it. I just see the seconds portion...

@ahtik
Copy link
Contributor
ahtik commented Dec 23, 2020

TF 2.3.1, LSTM 152, Mixed Precision, ~4.5 minutes (0:04:29)
Epoch 150/150
7/7 [==============================] - 1s 164ms/step - loss: -3.7234 - dense_loss: 2.0466e-04 - dense_1_loss: 5.1873e-04 - dense_2_loss: 5.6820e-04 - dense_3_loss: 8.1529e-04 - dense_4_loss: 7.8634e-04 - dense_5_loss: -6.2101 - dense_rmse: 0.0143 - dense_direction_accuracy: 0.4912 - dense_1_rmse: 0.0228 - dense_1_direction_accuracy: 0.6092 - dense_2_rmse: 0.0238 - dense_2_direction_accuracy: 0.8431 - dense_3_rmse: 0.0286 - dense_3_direction_accuracy: 0.9141 - dense_4_rmse: 0.0280 - dense_4_direction_accuracy: 0.9176 - dense_5_daily_accuracy: 0.0000e+00

TF 2.3.1, LSTM 152, Regular ("unmixed"), ~6 minutes
Epoch 150/150
7/7 [==============================] - 2s 303ms/step - loss: -3.7612 - dense_loss: 1.9700e-04 - dense_1_loss: 4.7340e-04 - dense_2_loss: 6.3550e-04 - dense_3_loss: 8.6339e-04 - dense_4_loss: 7.9784e-04 - dense_5_loss: -6.2736 - dense_rmse: 0.0140 - dense_direction_accuracy: 0.5128 - dense_1_rmse: 0.0218 - dense_1_direction_accuracy: 0.6382 - dense_2_rmse: 0.0252 - dense_2_direction_accuracy: 0.8350 - dense_3_rmse: 0.0294 - dense_3_direction_accuracy: 0.9080 - dense_4_rmse: 0.0282 - dense_4_direction_accuracy: 0.9141 - dense_5_daily_accuracy: 0.0000e+00

I wish I knew what else to look for to figure out the difference in setup?

@nectario
Copy link
Author

@ahtik You have an RTX 3070 and I have a Titan V. It may be because the RTX has big differences in the architecture compared to the older Titan V.

@ahtik
Copy link
Contributor
ahtik commented Dec 23, 2020 via email

@sanjoy
Copy link
Contributor
sanjoy commented Dec 23, 2020

Adding this code at the start, dropped my per epoch time to 1 second!

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Is it possible that you were using mixed precision in 2.3 but not in 2.4?

CC @reedwm

@nectario
Copy link
Author

@sanjoy Actually I take it back. At the moment with that enabled and not, I get 3 seconds per epoch. There was a discrepancy in the amount of data earlier when I observed this. I used the same data when I measured this now which is this:
new_data.zip

@nectario
Copy link
Author
nectario commented Dec 23, 2020

@sanjoy @ahtik

I just downgraded to TF 2.3.1 with CUDA 10.1/CUDNN 7.6

Training the same data takes 2 seconds per epoch versus 3 seconds per epoch on TF 2.4

So:

TF 2.4.0: 3 seconds
TF 2.3.1: 2 seconds

Both setups were identical. I simply downgraded the TF and CUDA versions.

Epoch 198/1500
{'loss': 0.4114486277103424, 'daily_loss': 0.00019866823276970536, 'weekly_loss': 0.0008702529012225568, 'monthly_loss': 0.0005495575023815036, 'three_month_loss': 0.0010519209317862988, 'yearly_loss': 0.0016778710996732116, 'direction_loss': 0.6785006523132324, 'daily_rmse': 0.014094972051680088, 'daily_direction_accuracy': 0.49185416102409363, 'weekly_rmse': 0.029500048607587814, 'weekly_direction_accuracy': 0.4587535560131073, 'monthly_rmse': 0.023442642763257027, 'monthly_direction_accuracy': 0.839927613735199, 'three_month_rmse': 0.03243333101272583, 'three_month_direction_accuracy': 0.8919058442115784, 'yearly_rmse': 0.04096182435750961, 'yearly_direction_accuracy': 0.9420739412307739, 'direction_daily_accuracy': 0.5637444853782654, 'val_loss': 0.42898598313331604, 'val_daily_loss': 0.00019334981334395707, 'val_weekly_loss': 0.0019688063766807318, 'val_monthly_loss': 0.0023292433470487595, 'val_three_month_loss': 0.001890965853817761, 'val_yearly_loss': 0.0008342369110323489, 'val_direction_loss': 0.7029489278793335, 'val_daily_rmse': 0.013905028812587261, 'val_daily_direction_accuracy': 0.446153849363327, 'val_weekly_rmse': 0.04437123239040375, 'val_weekly_direction_accuracy': 0.3384615480899811, 'val_monthly_rmse': 0.04826223477721214, 'val_monthly_direction_accuracy': 0.6769230961799622, 'val_three_month_rmse': 0.043485235422849655, 'val_three_month_direction_accuracy': 0.892307698726654, 'val_yearly_rmse': 0.028883159160614014, 'val_yearly_direction_accuracy': 0.8153846263885498, 'val_direction_daily_accuracy': 0.5076923370361328}
7/7 - 2s - loss: 0.4114 - daily_loss: 1.9867e-04 - weekly_loss: 8.7025e-04 - monthly_loss: 5.4956e-04 - three_month_loss: 0.0011 - yearly_loss: 0.0017 - direction_loss: 0.6785 - daily_rmse: 0.0141 - daily_direction_accuracy: 0.4919 - weekly_rmse: 0.0295 - weekly_direction_accuracy: 0.4588 - monthly_rmse: 0.0234 - monthly_direction_accuracy: 0.8399 - three_month_rmse: 0.0324 - three_month_direction_accuracy: 0.8919 - yearly_rmse: 0.0410 - yearly_direction_accuracy: 0.9421 - direction_daily_accuracy: 0.5637 - val_loss: 0.4290 - val_daily_loss: 1.9335e-04 - val_weekly_loss: 0.0020 - val_monthly_loss: 0.0023 - val_three_month_loss: 0.0019 - val_yearly_loss: 8.3424e-04 - val_direction_loss: 0.7029 - val_daily_rmse: 0.0139 - val_daily_direction_accuracy: 0.4462 - val_weekly_rmse: 0.0444 - val_weekly_direction_accuracy: 0.3385 - val_monthly_rmse: 0.0483 - val_monthly_direction_accuracy: 0.6769 - val_three_month_rmse: 0.0435 - val_three_month_direction_accuracy: 0.8923 - val_yearly_rmse: 0.0289 - val_yearly_direction_accuracy: 0.8154 - val_direction_daily_accuracy: 0.5077
Epoch 199/1500
{'loss': 0.41261425614356995, 'daily_loss': 0.00021032587392255664, 'weekly_loss': 0.000582002685405314, 'monthly_loss': 0.0005726460367441177, 'three_month_loss': 0.0009110110113397241, 'yearly_loss': 0.0015689559513702989, 'direction_loss': 0.681282103061676, 'daily_rmse': 0.014502615667879581, 'daily_direction_accuracy': 0.5011637210845947, 'weekly_rmse': 0.024124732241034508, 'weekly_direction_accuracy': 0.5919317007064819, 'monthly_rmse': 0.023930024355649948, 'monthly_direction_accuracy': 0.844840943813324, 'three_month_rmse': 0.030182959511876106, 'three_month_direction_accuracy': 0.8952676653862, 'yearly_rmse': 0.03961004689335823, 'yearly_direction_accuracy': 0.9475045204162598, 'direction_daily_accuracy': 0.5549521446228027, 'val_loss': 0.412332147359848, 'val_daily_loss': 0.00014936340448912233, 'val_weekly_loss': 0.0008999488200061023, 'val_monthly_loss': 0.002337433397769928, 'val_three_month_loss': 0.0011293195420876145, 'val_yearly_loss': 0.0006307148723863065, 'val_direction_loss': 0.6786422729492188, 'val_daily_rmse': 0.01222143229097128, 'val_daily_direction_accuracy': 0.4615384638309479, 'val_weekly_rmse': 0.0299991462379694, 'val_weekly_direction_accuracy': 0.5846154093742371, 'val_monthly_rmse': 0.04834701120853424, 'val_monthly_direction_accuracy': 0.6153846383094788, 'val_three_month_rmse': 0.03360534831881523, 'val_three_month_direction_accuracy': 1.0, 'val_yearly_rmse': 0.025114037096500397, 'val_yearly_direction_accuracy': 0.9076923131942749, 'val_direction_daily_accuracy': 0.5692307949066162}
7/7 - 2s - loss: 0.4126 - daily_loss: 2.1033e-04 - weekly_loss: 5.8200e-04 - monthly_loss: 5.7265e-04 - three_month_loss: 9.1101e-04 - yearly_loss: 0.0016 - direction_loss: 0.6813 - daily_rmse: 0.0145 - daily_direction_accuracy: 0.5012 - weekly_rmse: 0.0241 - weekly_direction_accuracy: 0.5919 - monthly_rmse: 0.0239 - monthly_direction_accuracy: 0.8448 - three_month_rmse: 0.0302 - three_month_direction_accuracy: 0.8953 - yearly_rmse: 0.0396 - yearly_direction_accuracy: 0.9475 - direction_daily_accuracy: 0.5550 - val_loss: 0.4123 - val_daily_loss: 1.4936e-04 - val_weekly_loss: 8.9995e-04 - val_monthly_loss: 0.0023 - val_three_month_loss: 0.0011 - val_yearly_loss: 6.3071e-04 - val_direction_loss: 0.6786 - val_daily_rmse: 0.0122 - val_daily_direction_accuracy: 0.4615 - val_weekly_rmse: 0.0300 - val_weekly_direction_accuracy: 0.5846 - val_monthly_rmse: 0.0483 - val_monthly_direction_accuracy: 0.6154 - val_three_month_rmse: 0.0336 - val_three_month_direction_accuracy: 1.0000 - val_yearly_rmse: 0.0251 - val_yearly_direction_accuracy: 0.9077 - val_direction_daily_accuracy: 0.5692
Epoch 200/1500
{'loss': 0.417228639125824, 'daily_loss': 0.00025663431733846664, 'weekly_loss': 0.0010450570844113827, 'monthly_loss': 0.0007138350629247725, 'three_month_loss': 0.0010068160481750965, 'yearly_loss': 0.002297411672770977, 'direction_loss': 0.686514675617218, 'daily_rmse': 0.016019809991121292, 'daily_direction_accuracy': 0.49211275577545166, 'weekly_rmse': 0.03232734277844429, 'weekly_direction_accuracy': 0.5898629426956177, 'monthly_rmse': 0.026717692613601685, 'monthly_direction_accuracy': 0.7913110852241516, 'three_month_rmse': 0.03173036500811577, 'three_month_direction_accuracy': 0.8929402828216553, 'yearly_rmse': 0.04793132096529007, 'yearly_direction_accuracy': 0.9330230355262756, 'direction_daily_accuracy': 0.5358158946037292, 'val_loss': 0.4141582250595093, 'val_daily_loss': 0.00014011705934535712, 'val_weekly_loss': 0.0018782862462103367, 'val_monthly_loss': 0.0017431046580895782, 'val_three_month_loss': 0.0013259940315037966, 'val_yearly_loss': 0.0009762737900018692, 'val_direction_loss': 0.6801573634147644, 'val_daily_rmse': 0.011837105266749859, 'val_daily_direction_accuracy': 0.4769230782985687, 'val_weekly_rmse': 0.04333920031785965, 'val_weekly_direction_accuracy': 0.32307693362236023, 'val_monthly_rmse': 0.041750505566596985, 'val_monthly_direction_accuracy': 0.6769230961799622, 'val_three_month_rmse': 0.03641420230269432, 'val_three_month_direction_accuracy': 1.0, 'val_yearly_rmse': 0.031245380640029907, 'val_yearly_direction_accuracy': 0.8153846263885498, 'val_direction_daily_accuracy': 0.5538461804389954}
7/7 - 2s - loss: 0.4172 - daily_loss: 2.5663e-04 - weekly_loss: 0.0010 - monthly_loss: 7.1384e-04 - three_month_loss: 0.0010 - yearly_loss: 0.0023 - direction_loss: 0.6865 - daily_rmse: 0.0160 - daily_direction_accuracy: 0.4921 - weekly_rmse: 0.0323 - weekly_direction_accuracy: 0.5899 - monthly_rmse: 0.0267 - monthly_direction_accuracy: 0.7913 - three_month_rmse: 0.0317 - three_month_direction_accuracy: 0.8929 - yearly_rmse: 0.0479 - yearly_direction_accuracy: 0.9330 - direction_daily_accuracy: 0.5358 - val_loss: 0.4142 - val_daily_loss: 1.4012e-04 - val_weekly_loss: 0.0019 - val_monthly_loss: 0.0017 - val_three_month_loss: 0.0013 - val_yearly_loss: 9.7627e-04 - val_direction_loss: 0.6802 - val_daily_rmse: 0.0118 - val_daily_direction_accuracy: 0.4769 - val_weekly_rmse: 0.0433 - val_weekly_direction_accuracy: 0.3231 - val_monthly_rmse: 0.0418 - val_monthly_direction_accuracy: 0.6769 - val_three_month_rmse: 0.0364 - val_three_month_direction_accuracy: 1.0000 - val_yearly_rmse: 0.0312 - val_yearly_direction_accuracy: 0.8154 - val_direction_daily_accuracy: 0.5538
Epoch 201/1500
{'loss': 0.41835010051727295, 'daily_loss': 0.0002527496835682541, 'weekly_loss': 0.0012075683334842324, 'monthly_loss': 0.0007212499622255564, 'three_month_loss': 0.001039160997606814, 'yearly_loss': 0.004198853857815266, 'direction_loss': 0.6848840713500977, 'daily_rmse': 0.015898102894425392, 'daily_direction_accuracy': 0.5024566650390625, 'weekly_rmse': 0.034750085324048996, 'weekly_direction_accuracy': 0.4915955662727356, 'monthly_rmse': 0.02685609646141529, 'monthly_direction_accuracy': 0.826997697353363, 'three_month_rmse': 0.032236021012067795, 'three_month_direction_accuracy': 0.887251079082489, 'yearly_rmse': 0.06479856371879578, 'yearly_direction_accuracy': 0.8704422116279602, 'direction_daily_accuracy': 0.5492630004882812, 'val_loss': 0.416068434715271, 'val_daily_loss': 0.00016652456542942673, 'val_weekly_loss': 0.0010100970976054668, 'val_monthly_loss': 0.002100505866110325, 'val_three_month_loss': 0.0014713642885908484, 'val_yearly_loss': 0.0015848802868276834, 'val_direction_loss': 0.6828917264938354, 'val_daily_rmse': 0.012904440052807331, 'val_daily_direction_accuracy': 0.446153849363327, 'val_weekly_rmse': 0.03178202360868454, 'val_weekly_direction_accuracy': 0.4307692348957062, 'val_monthly_rmse': 0.04583127424120903, 'val_monthly_direction_accuracy': 0.6461538672447205, 'val_three_month_rmse': 0.03835836797952652, 'val_three_month_direction_accuracy': 0.9692307710647583, 'val_yearly_rmse': 0.039810553193092346, 'val_yearly_direction_accuracy': 0.9076923131942749, 'val_direction_daily_accuracy': 0.5538461804389954}
7/7 - 2s - loss: 0.4184 - daily_loss: 2.5275e-04 - weekly_loss: 0.0012 - monthly_loss: 7.2125e-04 - three_month_loss: 0.0010 - yearly_loss: 0.0042 - direction_loss: 0.6849 - daily_rmse: 0.0159 - daily_direction_accuracy: 0.5025 - weekly_rmse: 0.0348 - weekly_direction_accuracy: 0.4916 - monthly_rmse: 0.0269 - monthly_direction_accuracy: 0.8270 - three_month_rmse: 0.0322 - three_month_direction_accuracy: 0.8873 - yearly_rmse: 0.0648 - yearly_direction_accuracy: 0.8704 - direction_daily_accuracy: 0.5493 - val_loss: 0.4161 - val_daily_loss: 1.6652e-04 - val_weekly_loss: 0.0010 - val_monthly_loss: 0.0021 - val_three_month_loss: 0.0015 - val_yearly_loss: 0.0016 - val_direction_loss: 0.6829 - val_daily_rmse: 0.0129 - val_daily_direction_accuracy: 0.4462 - val_weekly_rmse: 0.0318 - val_weekly_direction_accuracy: 0.4308 - val_monthly_rmse: 0.0458 - val_monthly_direction_accuracy: 0.6462 - val_three_month_rmse: 0.0384 - val_three_month_direction_accuracy: 0.9692 - val_yearly_rmse: 0.0398 - val_yearly_direction_accuracy: 0.9077 - val_direction_daily_accuracy: 0.5538
Epoch 202/1500

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 25, 2020
@sanjoy
Copy link
Contributor
sanjoy commented Dec 29, 2020

@nectario, have you tried collecting TensorBoard profiles for 2.4 and 2.3 to see if something jumps out?

@nectario
Copy link
Author

@nectario, have you tried collecting TensorBoard profiles for 2.4 and 2.3 to see if something jumps out?

I can try doing this.

@geetachavan1 geetachavan1 added the regression issue To spot regression issues in latest version label Jan 9, 2021
@nectario
Copy link
Author

Is there any luck with this issue? Will it be addressed in any of the upcoming releases?

@raphyphy
Copy link
raphyphy commented Apr 1, 2021

Hello, this is also a relevant issue to me. I have a personal rtx 2070 super and each epoch takes around 5s

these are the logs

2021-04-01 20:03:46.712589: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2.4.1
2021-04-01 20:03:48.725664: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-01 20:03:48.727030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-01 20:03:48.759345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 Super computeCapability: 7.5
coreClock: 1.38GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-04-01 20:03:48.759725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-01 20:03:48.765863: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-01 20:03:48.766005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-01 20:03:48.768815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-01 20:03:48.769891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-01 20:03:48.774016: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-01 20:03:48.776639: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-01 20:03:48.777397: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-01 20:03:48.777662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-01 20:03:48.778345: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-01 20:03:48.779352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 Super computeCapability: 7.5
coreClock: 1.38GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-04-01 20:03:48.779854: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-01 20:03:48.780161: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-01 20:03:48.780471: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-01 20:03:48.780708: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-01 20:03:48.780930: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-01 20:03:48.781206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-04-01 20:03:48.781430: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-01 20:03:48.781575: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-01 20:03:48.781744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-01 20:03:49.268423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-01 20:03:49.268576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-04-01 20:03:49.268669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-04-01 20:03:49.268882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6611 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 Super, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-04-01 20:03:49.269599: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
...
1875/1875 [==============================] - 7s 3ms/step - loss: 0.5143 - accuracy: 0.8183
Epoch 2/100
1875/1875 [==============================] - 5s 3ms/step - loss: 0.2595 - accuracy: 0.9065
Epoch 3/100
....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues regression issue To spot regression issues in latest version TF 2.4 for issues related to TF 2.4 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants