tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

matthewhelmi · 2023-12-01T08:30:50Z

I was preparing my own training script for multi-gpu support to achieve higher batch sizes using tf.distribute.MirroredStrategy().

Things were working initially but began freezing (seemingly out of the blue) on NVIDIA T4 (4 GPU) - VRAM 64gb, but still runs fine on NVIDIA A10G (4 GPU) - VRAM 96gb.

I used the following example to debug on NVIDIA T4 (4 GPU) - VRAM 64gb:

Custom training with tf.distribute.Strategy

I found that if I substitute their model (See link) with

model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(10, kernel_size=3, activation="relu"),
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(10, activation="relu")
   ])

the training freezes from the start (just like my own code, which is using a similar model).

My symptoms are similar to Multi-GPU training not starting or freezing. GPUs at 100%

It stops freezing if I replace GlobalAveragePooling2D() with a Flatten().

model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(10, kernel_size=3, activation="relu"),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(10, activation="relu")
 ])

I have also noticed freezing when introducing stride > 1 in the conv layer.

Everything runs fine on a single-gpu. Freezing only occurs when using >1 GPU.

The text was updated successfully, but these errors were encountered:

SuryanarayanaY · 2023-12-01T09:40:05Z

Hi @matthewhelmi ,

In Keras3 distribution strategies not yet implemented. It seems you are using tf.keras (Keras2) version. Could you please confirm the Tensorflow version you have tested the code? IMO, this issue needs to be reported at tf-keras repo.

matthewhelmi · 2023-12-01T10:07:20Z

Hi @SuryanarayanaY ,

Thank you for the quick reply!

The example I linked is using 2.14.0 (from tensorflow docs) - which freezes with the model I showed above.

my own code on NVIDIA T4 (4 GPU) - VRAM 64gb is using 2.13.1 - also freezes.

my own code on NVIDIA A10G (4 GPU) - VRAM 96gb is using 2.13.0- but works.

With your confirmation I'll happily move the issue to tf-keras.

fchollet · 2023-12-01T17:22:35Z

We have not made any modification to our handling of tf.distribute in Keras 2 from 2.13 to 2.14, and neither to GlobalAveragePooling2D. The issue must be a bug with tf.distribute. I would recommend opening an issue on the TensorFlow repo.

In Keras3 distribution strategies not yet implemented.

Actually Keras 3 with the TF backend is fully compatible with tf.distribute (except for ParameterServerStrategy).

You could also try using Keras 3, it gives you access to alternative ways to distribute (in particular via JAX and keras.distribution).

matthewhelmi · 2023-12-05T08:54:23Z

I have opened an issue on TensorFlow.

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #62571

SuryanarayanaY · 2024-01-31T05:40:47Z

Hi @matthewhelmi , As the issue needs to be addressed in TF repo where already you have raised a ticket, can you close it here and track there ?

Thanks!

matthewhelmi · 2024-01-31T11:38:22Z

Will do! My apologies for forgetting

google-ml-butler · 2024-01-31T11:38:25Z

Are you satisfied with the resolution of your issue?
Yes
No

github-actions bot assigned SuryanarayanaY Dec 1, 2023

SuryanarayanaY added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. backend:tensorflow labels Dec 1, 2023

SuryanarayanaY added the stat:awaiting response from contributor label Dec 1, 2023

google-ml-butler bot removed the stat:awaiting response from contributor label Dec 1, 2023

Venkat6871 mentioned this issue Dec 8, 2023

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() tensorflow/tensorflow#62571

Open

SuryanarayanaY added the stat:awaiting response from contributor label Jan 31, 2024

matthewhelmi closed this as completed Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

Comments