-
Notifications
You must be signed in to change notification settings - Fork 19.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862
Comments
Hi @matthewhelmi , In Keras3 distribution strategies not yet implemented. It seems you are using tf.keras (Keras2) version. Could you please confirm the Tensorflow version you have tested the code? IMO, this issue needs to be reported at tf-keras repo. |
Hi @SuryanarayanaY , Thank you for the quick reply! The example I linked is using 2.14.0 (from tensorflow docs) - which freezes with the model I showed above. my own code on NVIDIA T4 (4 GPU) - VRAM 64gb is using 2.13.1 - also freezes. my own code on NVIDIA A10G (4 GPU) - VRAM 96gb is using 2.13.0- but works. With your confirmation I'll happily move the issue to tf-keras. |
We have not made any modification to our handling of
Actually Keras 3 with the TF backend is fully compatible with You could also try using Keras 3, it gives you access to alternative ways to distribute (in particular via JAX and |
I have opened an issue on TensorFlow. tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #62571 |
Hi @matthewhelmi , As the issue needs to be addressed in TF repo where already you have raised a ticket, can you close it here and track there ? Thanks! |
Will do! My apologies for forgetting |
I was preparing my own training script for multi-gpu support to achieve higher batch sizes using tf.distribute.MirroredStrategy().
Things were working initially but began freezing (seemingly out of the blue) on NVIDIA T4 (4 GPU) - VRAM 64gb, but still runs fine on NVIDIA A10G (4 GPU) - VRAM 96gb.
I used the following example to debug on NVIDIA T4 (4 GPU) - VRAM 64gb:
Custom training with tf.distribute.Strategy
I found that if I substitute their model (See link) with
the training freezes from the start (just like my own code, which is using a similar model).
My symptoms are similar to Multi-GPU training not starting or freezing. GPUs at 100%
It stops freezing if I replace GlobalAveragePooling2D() with a Flatten().
I have also noticed freezing when introducing stride > 1 in the conv layer.
Everything runs fine on a single-gpu. Freezing only occurs when using >1 GPU.
The text was updated successfully, but these errors were encountered: