-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF 2.3 training slowed down by 15% compared to 2.2 #41827
Comments
I was able to reproduce the issue. Here is the gist.. |
Thanks for the report @lgeiger . We will look more into this. The code sample uses MirroredStrategy - so I wanted to clarify when you said this regression is noticed on a single GPU as well - is that with or without MirroredStrategy? If latter, we would start investigating that case first (i.e. 1 GPU, no distribution, no mixed precision). |
@guptapriya Thanks for looking into this.
Sorry about that, I missed that since I copied the example from a past issue I had with multi-GPU. I ran a few more benchmarks with the above example on a single GPU machine:
Indeed, it looks like whether mirrored strategy is used or not has a large influence. I am not sure why there is a difference in execution speed for mixed precision with and without a strategy, although I think that might be a seperate issue. One thing to note is that 2.3 logs the following deprecation warning when used with mirrored strategy which wasn't present before, but that might be unrelated as well:
Note that this slowdown is more noticable with mixed precision as the kernel execution time is smaller so the increased idle time is easier to spot. |
Thanks @lgeiger for the update, that helps a lot. I can confirm that we have verified the regression (looks like it happened sometime back in April) |
@guptapriya Thank you very much the fast response, I am glad that you were able to verify the regression.
April sounds like a long time for the regression to stay unnoticed. Am I doing something abnormal in the example or am I missing a best practice here? |
Actually now we are no longer sure about the timing because the benchmark I used to get the timing was actually changed in April. So we are now looking into the timing of regression again. It may have come up later. |
Makes sense, thanks for the help.
I usually stay on the stable versions of TensorFlow for production workloads so I first noticed it around 2.3 RC 0 or 1 when running our internal ImageNet sanity check before upgrading TF version. But it took some time to find an reliably reproducible example that I could use to open the issue. |
@lgeiger thanks for providing the tensorboard profile. We have some observations in your profiles which use fp16. You mentioned that the regression also happens on fp32 (and multi-gpu). Do you happen to have the fp32 profiles of TF 2.2 and 2.3? Thanks. |
@zongweiz Sure, I reran the above code example (float32 with mirrored strategy) to generate these profiles: |
I hope the profiles are helpful for debugging, do you have any news on the resolution of this issue? |
@jaingaurav is looking into a potential fix I believe. |
Yes, we found some unintended host to device copies caused by a previous change, that I am trying to eliminate. |
i also found that tensorflow 2.3 mixed precision didn't speed-up training as tensorflow 2.2.0 does :)). The only improvement on tensorflow2.3 i see is the less overhead time of multi-gpu training |
@jaingaurav Thanks for looking into it. Do you know if this fix will make it into the 2.4 release? |
@lgeiger: Yes this is currently planned for the 2.4 release. However the fix is still being worked on. |
@jaingaurav I just checked with the new release candidate and CUDA 11, but unfortunately this issue still exists and the increased idle time is clearly visible in the profiles. Below are the numbers for
|
cc: @rohan100jain, @zongweiz looks like the fix in 673b993 didn't quite work as well as we'd hoped. |
Thanks for looking into it. I also updated the table with measurements using the new Keras mixed precision API which makes the slowdown easier to spot. |
@rohan100jain Has there been any progress on this? From briefly skimming the changes in RC1, it looks like it doesn't include a fix for this. This bug has been blocking me from upgrading to 2.3 so it would be great to get this resolved in 2.4. Please let me know if there anything that I could do from my side to help you debug this further? |
I double checked and the issue still exists in tf-nightly 2.5.0-dev20201110. |
@rohan100jain @zongweiz @jaingaurav I did a bit more testing with I am a bit confused why keeping the remainder results in such a significant performance degradation, but I didn't look into it in detail yet. The mixed precision training runs are still show a high idle time, but slow down for mirrored strategy isn't as significant as before. Here are the TensorBoard profiles for the measurements shown below:
|
@lgeiger Thanks very much for your information. Yes, your observation matches what we found. We have tracked down the performance issue to a tf.cond inside Keras batch norm layer (which is to handle empty batches). Setting drop_remainder=True avoids empty/partial batches and works around the problem. @rohan100jain is working on a fix and will give an update very soon. I think the reason we see more regression on FP16 is because: the above problem is more significant when the workload has higher overhead in launching GPU kernels, fp16 makes some compute kernels more efficient and makes the workload more kernel launch bound. |
@zongweiz Thanks for looking into it. Looking forward to a fix. |
@rohan100jain Do you have any updates whether a fix will make it into the 2.4 release? |
@rohan100jain @zongweiz @jaingaurav Sorry for pinging you again. Do you have any updates on whether the fix will make it into the 2.4 stable release? |
Hey @lgeiger i believe the fix did not make it to 2.4 unfortunately. @rohan100jain @zongweiz @goldiegadde please correct if that is not the case. |
Thanks for the response, that's really unfortunate since the regression has been there since 2.2. But at least we can now upgrade to 2.4 when enabling |
@guptapriya has there been any progress on this issue? It would be good if a fix would make it into TF 2.5 since in our current workloads running on 4 GPUs we are seeing slowdowns of 30-80% due to this bug compared to TF 2.2 (using XLA makes this regression even more dramatic). |
@lgeiger It looks like the last fix that was tried in November did not fix the issue and I don't see any other updates since then. Checking with @rohan100jain , will update when I know more. |
Apologies but I tried fixing it last year and that change had to be rolled back / didn't do what we expected to. We'll continue to work on it and get a fix out by 2.5 |
@rohan100jain Thanks for the update. Let me know when a fix lands and I am can rerun the benchmarks to verify. |
I'm sorry we looked into this issue and there isn't really any easy way of fixing this without rolling back a change (f0d0485) that enhances dtype coverage of our GPU ops and improves the consistency of Tensorflow in general. This issue has exposed some problems we need to fix with our device placement that we're planning to work on and will have an RFC for it. I'll therefore recommend that you continue to use the drop_remainder=True workaround for now. |
Thanks for the update. I will continue using It would be awesome if it is possible to add this example (or a similar one using MirroredStrategy and a large cached dataset) to your internal regression testing suite. In a lot of the TF version upgrades I have done in the past I discovered some sort of memory issue or performance regression that was reproducible with code very similar the example mentioned above (See #36240, #38617, #38655). It would be excellent if issues like that would be caught automatically so they don't make it into the stable releases. |
System information
Describe the current behavior
When upgrading from TensorFlow 2.2.0 to 2.3.0 we observed a 15 - 18% slow down in training speed for our workloads. Unfortunately I wasn't able to find an easy to reproduce example before the stable release was cut, but below is a code example that illustrates the performance degradation.
When running the training script on a single NVIDIA V100 a 15% performance loss compared to 2.2 can be observed which still is noticable in the latest nightly:
On Device: total self-time (grouped by type)
The example uses auto mixed precision, but the slowdown can also be observed when running in
float32
or in multi-GPU training. When looking at the generated execution profile the slowdown can be explained by an increased idle time of the GPU. Since the training data is cached in memory there should be no IO bottleneck so I am not sure if this performance regression is caused bytf.data
or by the runtime itself.Describe the expected behavior
TensorFlow 2.3 should show equally fast training performance compared to 2.2.
Standalone code to reproduce the issue
Other info / logs
TensorBoard profiles for the runs mentioned above are available at tb-profile.zip
@mihaimaruseac @jsimsa @guptapriya do you mind taking a look at this?
The text was updated successfully, but these errors were encountered: