[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Open
Daniel451 opened this issue Feb 10, 2022 · 10 comments
Open

Multi-GPU training not starting or freezing. GPUs at 100% #54325

Daniel451 opened this issue Feb 10, 2022 · 10 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.8 type:bug Bug

Comments

@Daniel451
Copy link

Very same script runs without issues on 1 GPU (setting NVIDIA_VISIBLE_DEVICES to one GPU) but freezes with 2 or 3 GPUs.

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu Server 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
  • TensorFlow installed from (source or binary): 2.7 (Anaconda), 2.8 (official docker), 2.8 (self compiled docker)
  • TensorFlow version (use command below): 2.8 & 2,7
  • Python version: 3.9
  • Bazel version (if compiling from source): 4.2.1
  • GCC/Compiler version (if compiling from source): 9
  • CUDA/cuDNN version: 11.4
  • GPU model and memory: 3x RTX A6000 / 3x 48GB

Describe the current behavior
Training script hangs at executing model.fit(). Terminal shows Loaded cuDNN version 8201 thrice (once for each GPU) and displays "TensorFloat-32 will be used..." once. GPUs turn to 100% but nothing gets calculated, CPU usage is nearly 0%, "Epoch 1/50" is displayed by model.fit() but no progress bar and nothing happens. Script is also unresponsive, even to Ctrl+C => process has to be killed manually.

Describe the expected behavior
After showing "Loaded cuDNN 8201" training should start and progress bar of model.fit() call should show up.

Contributing

  • Do you want to contribute a PR? (yes/no): no
  • Briefly describe your candidate solution(if contributing): -

Standalone code to reproduce the issue
My code is a more complex scenario but even this simple code here shows the same issue (executes right away on 1 GPU but freezes with 2 or 3 GPUs):

import tensorflow as tf
import numpy as np
 
 
X = np.random.random((1000, 128, 128, 3)).astype(np.float32)
Y = np.random.random((1000, 10)).astype(np.float32)
 
dat_x = tf.data.Dataset.from_tensor_slices(X)
dat_y = tf.data.Dataset.from_tensor_slices(Y)
ds = tf.data.Dataset.zip((dat_x, dat_y))
ds = ds.batch(96)
ds = ds.repeat(50)
 
strategy = tf.distribute.MirroredStrategy()
 
with strategy.scope():
    inputs = tf.keras.Input(shape=(128, 128, 3))
    x = tf.keras.layers.Conv2D(32, (3, 3))(inputs)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Conv2D(32, (3, 3))(x)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(10)(x)
    model = tf.keras.Model(inputs=inputs, outputs=x)
    optim = tf.optimizers.Adam()
 
    model.compile(optim, loss=tf.keras.losses.CategoricalCrossentropy())
 
 
model.fit(ds, epochs=50)

@Daniel451 Daniel451 added the type:bug Bug label Feb 10, 2022
@mohantym mohantym added TF 2.8 comp:gpu GPU related issues labels Feb 10, 2022
@mohantym mohantym assigned sachinprasadhs and unassigned mohantym Feb 10, 2022
@Daniel451 Daniel451 changed the title Multi-GPU training not starting. GPUs at 100% Multi-GPU training not starting or freezing. GPUs at 100% Feb 10, 2022
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 11, 2022
@jiayugedede
Copy link

I have the same problems, I have checked the driver(ok), cuda(ok), cudnn(ok), nccl(ok), all of these were normal operation. However, the distribution training was broken and frozen. It is so disappointing.

@jurahul
Copy link
Contributor
jurahul commented Mar 24, 2022

Do you have any logs that I can look into?

@lcorcodilos
Copy link
lcorcodilos commented Apr 13, 2022

Commenting just to say that I'm seeing the same behavior though with cuDNN version 8400

EDIT: Felt this might be useful - If I use

strategy = tf.distribute.MirroredStrategy(devices=['/gpu:0'])

things work fine. If I switch it to /gpu:1, I get the following errors:

...
Node: 'Assert/Assert'
3 root error(s) found.
  (0) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
         [[while/body/_1/while/div_no_nan_1/ReadVariableOp/_95]]
  (1) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
         [[while/body/_1/gradient_tape/while/sequential/global_average_pooling2d/ones/_86]]
  (2) INVALID_ARGUMENT:  assertion failed: [loop must iterate at least once to initialize outputs]
         [[{{node Assert/Assert}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_5209]

In fact, using with tf.device('/gpu:1'): instead of with strategy leads to the same error. However, I can remove the error by adding

os.environ["CUDA_VISIBLE_DEVICES"]="1"
...
with tf.device('/GPU:1'):

So I'm guessing (at least in my case), that there's something going on with multiple GPUs more generally and MirroredStrategy just isn't catching the errors.

@Daniel451
Copy link
Author

After lots of headache and different attempts I am suspecting that it actually has to do with the cuda architecture in use and the OS version / driver version / linux kernel version.

In my case, nothing helped to get multiple GPUs to work other than updating to Ubuntu 22.04, ensuring that a new linux kernel 5.15 is running and having CUDA >= 11.3 + updating to latest driver.

I guess for new(er) GPUs, as mentioned I was using 3x RTX A6000, thus Ampere generation NVIDIA GPUs, there is something that simply breaks with Ubuntu 20.04 and/or the old linux kernel available in aforementioned Ubuntu. Even an Ubuntu 22.04 in a docker environment was not helping.

So TL;DR:
For me updating to a more modern Ubuntu, updating linux kernel and then using CUDA >= 11.3 and latest nvidia driver worked.

Any other attempt, like different CUDA versions, driver versions, with or without docker, with Anaconda, ... did not work under Ubuntu 20.04 and older linux kernel with new NVIDIA Ampere cards.

@Daniel451
Copy link
Author

@jiayugedede @jurahul @lcorcodilos which OS + linux kernel + CUDA + driver combination where you using? Could you test if updating to new(er) OS + linux kernel resolves the issue for you, too?

@jayagami
Copy link

This could be a nccl problem.

https://github.com/NVIDIA/nccl-tests

Try

NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

and you will get the same result

@aminfardi-CD
Copy link

Is there a solution to this? Running into the same issue and all my packages are upgraded as suggested.

@jurahul
Copy link
Contributor
jurahul commented Sep 12, 2023

Assigning to @rohan100jain to find the right owner for this issue.

@WissamAntoun
Copy link

I had an issue similar to this, and i fixed it by disabling IOMMU in the bios. This is relevant if you have an AMD CPU and using Data Parallel

@aminfardi-CD
Copy link

I got around the issue for now by modifying the cross_device_ops to tf.distribute.HierarchicalCopyAllReduce() as per stackoverflow . Full code snippet:

strategy = tf.distribute.MirroredStrategy( cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())

I still think this is an issue with the nccl, because the default argument for that parameter is tf.distribute.NcclAllReduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.8 type:bug Bug
Projects
None yet
Development

No branches or pull requests

10 participants