[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #18862

Closed
matthewhelmi opened this issue Dec 1, 2023 · 7 comments
Assignees
Labels
backend:tensorflow stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.

Comments

@matthewhelmi
Copy link
matthewhelmi commented Dec 1, 2023

I was preparing my own training script for multi-gpu support to achieve higher batch sizes using tf.distribute.MirroredStrategy().

Things were working initially but began freezing (seemingly out of the blue) on NVIDIA T4 (4 GPU) - VRAM 64gb, but still runs fine on NVIDIA A10G (4 GPU) - VRAM 96gb.

I used the following example to debug on NVIDIA T4 (4 GPU) - VRAM 64gb:

Custom training with tf.distribute.Strategy

I found that if I substitute their model (See link) with

model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(10, kernel_size=3, activation="relu"),
        tf.keras.layers.GlobalAveragePooling2D(),
        tf.keras.layers.Dense(10, activation="relu")
   ])

the training freezes from the start (just like my own code, which is using a similar model).

My symptoms are similar to Multi-GPU training not starting or freezing. GPUs at 100%

It stops freezing if I replace GlobalAveragePooling2D() with a Flatten().

model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(10, kernel_size=3, activation="relu"),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(10, activation="relu")
 ])

I have also noticed freezing when introducing stride > 1 in the conv layer.

Everything runs fine on a single-gpu. Freezing only occurs when using >1 GPU.

@SuryanarayanaY SuryanarayanaY added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. backend:tensorflow labels Dec 1, 2023
@SuryanarayanaY
Copy link
Contributor

Hi @matthewhelmi ,

In Keras3 distribution strategies not yet implemented. It seems you are using tf.keras (Keras2) version. Could you please confirm the Tensorflow version you have tested the code? IMO, this issue needs to be reported at tf-keras repo.

@matthewhelmi
Copy link
Author

Hi @SuryanarayanaY ,

Thank you for the quick reply!

The example I linked is using 2.14.0 (from tensorflow docs) - which freezes with the model I showed above.

my own code on NVIDIA T4 (4 GPU) - VRAM 64gb is using 2.13.1 - also freezes.

my own code on NVIDIA A10G (4 GPU) - VRAM 96gb is using 2.13.0- but works.

With your confirmation I'll happily move the issue to tf-keras.

@fchollet
Copy link
Member
fchollet commented Dec 1, 2023

We have not made any modification to our handling of tf.distribute in Keras 2 from 2.13 to 2.14, and neither to GlobalAveragePooling2D. The issue must be a bug with tf.distribute. I would recommend opening an issue on the TensorFlow repo.

In Keras3 distribution strategies not yet implemented.

Actually Keras 3 with the TF backend is fully compatible with tf.distribute (except for ParameterServerStrategy).

You could also try using Keras 3, it gives you access to alternative ways to distribute (in particular via JAX and keras.distribution).

@matthewhelmi
Copy link
Author

@SuryanarayanaY
Copy link
Contributor

Hi @matthewhelmi , As the issue needs to be addressed in TF repo where already you have raised a ticket, can you close it here and track there ?

Thanks!

@matthewhelmi
Copy link
Author

Will do! My apologies for forgetting

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:tensorflow stat:awaiting response from contributor type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited.
Projects
None yet
Development

No branches or pull requests

3 participants