Multi-GPU training not starting or freezing. GPUs at 100% #54325

Daniel451 · 2022-02-10T12:24:30Z

Very same script runs without issues on 1 GPU (setting NVIDIA_VISIBLE_DEVICES to one GPU) but freezes with 2 or 3 GPUs.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu Server 20.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
TensorFlow installed from (source or binary): 2.7 (Anaconda), 2.8 (official docker), 2.8 (self compiled docker)
TensorFlow version (use command below): 2.8 & 2,7
Python version: 3.9
Bazel version (if compiling from source): 4.2.1
GCC/Compiler version (if compiling from source): 9
CUDA/cuDNN version: 11.4
GPU model and memory: 3x RTX A6000 / 3x 48GB

Describe the current behavior
Training script hangs at executing model.fit(). Terminal shows Loaded cuDNN version 8201 thrice (once for each GPU) and displays "TensorFloat-32 will be used..." once. GPUs turn to 100% but nothing gets calculated, CPU usage is nearly 0%, "Epoch 1/50" is displayed by model.fit() but no progress bar and nothing happens. Script is also unresponsive, even to Ctrl+C => process has to be killed manually.

Describe the expected behavior
After showing "Loaded cuDNN 8201" training should start and progress bar of model.fit() call should show up.

Contributing

Do you want to contribute a PR? (yes/no): no
Briefly describe your candidate solution(if contributing): -

Standalone code to reproduce the issue
My code is a more complex scenario but even this simple code here shows the same issue (executes right away on 1 GPU but freezes with 2 or 3 GPUs):

import tensorflow as tf
import numpy as np
 
 
X = np.random.random((1000, 128, 128, 3)).astype(np.float32)
Y = np.random.random((1000, 10)).astype(np.float32)
 
dat_x = tf.data.Dataset.from_tensor_slices(X)
dat_y = tf.data.Dataset.from_tensor_slices(Y)
ds = tf.data.Dataset.zip((dat_x, dat_y))
ds = ds.batch(96)
ds = ds.repeat(50)
 
strategy = tf.distribute.MirroredStrategy()
 
with strategy.scope():
    inputs = tf.keras.Input(shape=(128, 128, 3))
    x = tf.keras.layers.Conv2D(32, (3, 3))(inputs)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Conv2D(32, (3, 3))(x)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(10)(x)
    model = tf.keras.Model(inputs=inputs, outputs=x)
    optim = tf.optimizers.Adam()
 
    model.compile(optim, loss=tf.keras.losses.CategoricalCrossentropy())
 
 
model.fit(ds, epochs=50)

The text was updated successfully, but these errors were encountered:

jiayugedede · 2022-03-24T08:07:41Z

I have the same problems, I have checked the driver(ok), cuda(ok), cudnn(ok), nccl(ok), all of these were normal operation. However, the distribution training was broken and frozen. It is so disappointing.

jurahul · 2022-03-24T16:38:37Z

Do you have any logs that I can look into?

lcorcodilos · 2022-04-13T03:11:14Z

Commenting just to say that I'm seeing the same behavior though with cuDNN version 8400

EDIT: Felt this might be useful - If I use

strategy = tf.distribute.MirroredStrategy(devices=['/gpu:0'])

things work fine. If I switch it to /gpu:1, I get the following errors:

...
Node: 'Assert/Assert'
3 root error(s) found.
  (0) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
         [[while/body/_1/while/div_no_nan_1/ReadVariableOp/_95]]
  (1) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
         [[while/body/_1/gradient_tape/while/sequential/global_average_pooling2d/ones/_86]]
  (2) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_5209]

In fact, using with tf.device('/gpu:1'): instead of with strategy leads to the same error. However, I can remove the error by adding

os.environ["CUDA_VISIBLE_DEVICES"]="1"
...
with tf.device('/GPU:1'):

So I'm guessing (at least in my case), that there's something going on with multiple GPUs more generally and MirroredStrategy just isn't catching the errors.

Daniel451 · 2022-09-26T22:47:46Z

After lots of headache and different attempts I am suspecting that it actually has to do with the cuda architecture in use and the OS version / driver version / linux kernel version.

In my case, nothing helped to get multiple GPUs to work other than updating to Ubuntu 22.04, ensuring that a new linux kernel 5.15 is running and having CUDA >= 11.3 + updating to latest driver.

I guess for new(er) GPUs, as mentioned I was using 3x RTX A6000, thus Ampere generation NVIDIA GPUs, there is something that simply breaks with Ubuntu 20.04 and/or the old linux kernel available in aforementioned Ubuntu. Even an Ubuntu 22.04 in a docker environment was not helping.

So TL;DR:
For me updating to a more modern Ubuntu, updating linux kernel and then using CUDA >= 11.3 and latest nvidia driver worked.

Any other attempt, like different CUDA versions, driver versions, with or without docker, with Anaconda, ... did not work under Ubuntu 20.04 and older linux kernel with new NVIDIA Ampere cards.

Daniel451 · 2022-09-26T22:49:32Z

@jiayugedede @jurahul @lcorcodilos which OS + linux kernel + CUDA + driver combination where you using? Could you test if updating to new(er) OS + linux kernel resolves the issue for you, too?

jayagami · 2023-01-10T08:16:04Z

This could be a nccl problem.

https://github.com/NVIDIA/nccl-tests

Try

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

and you will get the same result

aminfardi-CD · 2023-09-12T15:55:17Z

Is there a solution to this? Running into the same issue and all my packages are upgraded as suggested.

jurahul · 2023-09-12T16:53:42Z

Assigning to @rohan100jain to find the right owner for this issue.

WissamAntoun · 2023-09-13T09:40:52Z

I had an issue similar to this, and i fixed it by disabling IOMMU in the bios. This is relevant if you have an AMD CPU and using Data Parallel

aminfardi-CD · 2023-09-13T15:11:19Z

I got around the issue for now by modifying the cross_device_ops to tf.distribute.HierarchicalCopyAllReduce() as per stackoverflow . Full code snippet:

strategy = tf.distribute.MirroredStrategy( cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

I still think this is an issue with the nccl, because the default argument for that parameter is tf.distribute.NcclAllReduce

Daniel451 added the type:bug Bug label Feb 10, 2022

google-ml-butler bot assigned mohantym Feb 10, 2022

mohantym added TF 2.8 comp:gpu GPU related issues labels Feb 10, 2022

mohantym assigned sachinprasadhs and unassigned mohantym Feb 10, 2022

Daniel451 changed the title ~~Multi-GPU training not starting. GPUs at 100%~~ Multi-GPU training not starting or freezing. GPUs at 100% Feb 10, 2022

sachinprasadhs assigned jurahul Feb 11, 2022

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 11, 2022

jurahul assigned rohan100jain and Daniel451 and unassigned rohan100jain, jurahul, Daniel451 and sachinprasadhs Sep 12, 2023

This was referenced Dec 1, 2023

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() keras-team/keras#18862

Closed

tf.keras.layers.GlobalAveragePooling2D() freezes tf.distribute.MirroredStrategy() #62571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Comments