-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.2] XLA requires 2x GPU memory with sparse_categorical_crossentropy #38675
Comments
@lgeiger Thanks for looking into this! We'll track this bug, but in general we have found that it is very difficult to deal with such problems when using autoclustering: so we try to use explicit compilation scopes with |
@cheshire Thanks for taking a look, appreciate you help! I tried to narrow the problem a bit down though, and unfortunately after more testing is seems that even when removing I'll try to investigate further, but it's tricky to narrow it down. |
For now it looks like the only reliable way for me to get this working is to either disable XLA autoclustering or to not use distributed training. |
Hi @lgeiger ! But, Here are my pointers on this issue. It seems you have not followed XLA syntax. Attached gist with XLA syntax for reference. Thank you! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
Describe the current behavior
I am currently trying to upgrade from TensorFlow 2.0.0 to 2.2.0rc3 and noticed a regression related to model training using XLA in a multi-GPU environment (GCP VM with 4 NVIDIA V100s).
The training code linked below runs comfortably with a huge batch size of 3072 on 4 GPUs using the normal TensorFlow runtime. However when enabling XLA with
TF_XLA_FLAGS=--tf_xla_auto_jit=2
the same code runs out of GPU memory on the first batch of data. With XLA I can only use a maximum batch size of 1536 (50%) to prevent the code from running out of memory which doesn't seem right.In which cases are the memory requirements of XLA and the default runtime similar?
To narrow down the possible causes for this I found a few cases where the maximum batch size for XLA and the normal runtime are the same:
TensorFlow 2.0.0 doesn't seem to show this issue.
Removing
.prefetch(1)
from the datapipline fixes the issue.Changing the training to one-hot encoded labels seems to fix the increase XLA memory requirements as well. To test this I changed the preprocessing and loss to:
and
The above conditions suggest that the prefetched
int32
labels and sparse categorical cross entropy might cause the regression with XLA, though I might miss something here. Any help would be very appreciated.Describe the expected behavior
GPU memory requirements (messured here by maximum usable batch size) should be similar between XLA and the default runtime.
Standalone code to reproduce the issue
Other info / logs
I attached the XLA dumps below:
xla_dumps.tar.gz
The text was updated successfully, but these errors were encountered: