-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training not starting or freezing. GPUs at 100% #54325
Comments
I have the same problems, I have checked the driver(ok), cuda(ok), cudnn(ok), nccl(ok), all of these were normal operation. However, the distribution training was broken and frozen. It is so disappointing. |
Do you have any logs that I can look into? |
Commenting just to say that I'm seeing the same behavior though with EDIT: Felt this might be useful - If I use strategy = tf.distribute.MirroredStrategy(devices=['/gpu:0']) things work fine. If I switch it to
In fact, using
So I'm guessing (at least in my case), that there's something going on with multiple GPUs more generally and MirroredStrategy just isn't catching the errors. |
After lots of headache and different attempts I am suspecting that it actually has to do with the cuda architecture in use and the OS version / driver version / linux kernel version. In my case, nothing helped to get multiple GPUs to work other than updating to I guess for new(er) GPUs, as mentioned I was using 3x RTX A6000, thus Ampere generation NVIDIA GPUs, there is something that simply breaks with So TL;DR: Any other attempt, like different CUDA versions, driver versions, with or without docker, with Anaconda, ... did not work under |
@jiayugedede @jurahul @lcorcodilos which OS + linux kernel + CUDA + driver combination where you using? Could you test if updating to new(er) OS + linux kernel resolves the issue for you, too? |
This could be a nccl problem. https://github.com/NVIDIA/nccl-tests Try
and you will get the same result |
Is there a solution to this? Running into the same issue and all my packages are upgraded as suggested. |
Assigning to @rohan100jain to find the right owner for this issue. |
I had an issue similar to this, and i fixed it by disabling IOMMU in the bios. This is relevant if you have an AMD CPU and using Data Parallel |
I got around the issue for now by modifying the
I still think this is an issue with the nccl, because the default argument for that parameter is |
Very same script runs without issues on 1 GPU (setting NVIDIA_VISIBLE_DEVICES to one GPU) but freezes with 2 or 3 GPUs.
System information
Describe the current behavior
Training script hangs at executing
model.fit()
. Terminal showsLoaded cuDNN version 8201
thrice (once for each GPU) and displays "TensorFloat-32 will be used..." once. GPUs turn to 100% but nothing gets calculated, CPU usage is nearly 0%, "Epoch 1/50" is displayed bymodel.fit()
but no progress bar and nothing happens. Script is also unresponsive, even to Ctrl+C => process has to be killed manually.Describe the expected behavior
After showing "Loaded cuDNN 8201" training should start and progress bar of
model.fit()
call should show up.Contributing
Standalone code to reproduce the issue
My code is a more complex scenario but even this simple code here shows the same issue (executes right away on 1 GPU but freezes with 2 or 3 GPUs):
The text was updated successfully, but these errors were encountered: