Introduce ability to clear GPU memory in Tensorflow 2 #48545

GatGit12 · 2021-04-15T14:48:12Z

Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template

System information

TensorFlow version (you are using): 2.3.1, 2.4.1
Are you willing to contribute it (Yes/No): No

Describe the feature and the current behavior/state.
Currently there is no way to completely free the (once) allocated GPU RAM.
For example, i want to use tensorflow in the context of 3d visualization which is made next to impossible by this behavior. Standard solutions like tf.config.experimental.set_memory_growth(gpus[0], True) are unfortunately not sufficient, because the once allocated RAM cannot be released again.

In #36465 (#36465 (comment)), it is mentioned that by using GPUProcessState::TestOnlyReset and ProcessState::TestOnlyReset the option to release GPU memory exists, but is just not exposed or for testing purposes only.

It would be very nice for applications using tensorflow to have proper access to gpu ram release functions.

Will this change the current api? How?
Introduce a new (experimental) function to reset the current session/graph/device/... - state and thus free the GPU RAM completely.

Who will benefit with this feature?
People who use Tensorflow in their application in conjunction with other GPU-RAM critical operations such as 3D rendering.

The text was updated successfully, but these errors were encountered:

amahendrakar · 2021-04-15T17:43:11Z

@GatGit12,
Since similar issue is already being tracking in issue #36465, to avoid duplicates can you please close this and subscribe/follow that issue? Thanks!

GatGit12 · 2021-04-16T07:50:43Z

@GatGit12,
Since similar issue is already being tracking in issue #36465, to avoid duplicates can you please close this and subscribe/follow that issue? Thanks!

But this issue (#36465) is marked as a bug, with no attention and no explicit feature request, which is why i formally opened this post here as a feature request.

Also there are several issues with the gpu ram clearning which are simply ignored...

Here is a selection (without guarantee of completeness): #39535, #19571, #15880, #20387

sanjoy · 2021-04-20T04:31:49Z

Hi,

We expect this to be a non-issue once we're using the CUDA malloc async allocator by default. Can you give it a try? You can enable it by adding TF_GPU_ALLOCATOR=cuda_malloc_async to the environment.

GatGit12 · 2021-04-20T13:56:02Z

Hi,

could you please clarify with which TF version and CUDA version i can/should try this option?

A quick search in this repo revealed that at least CUDA 11.2 is required:

tensorflow/tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h

Line 44 in 926817c

// It needs CUDA 11.2+. When using a container, this only needs the

But TF 2.4.0/2.4.1 is only built for CUDA 11.0 (https://www.tensorflow.org/install/source_windows#gpu).
So I tried it with the current TensorFlow 2.5.0-rc1 which supports CUDA 11.2.

TensorFlow pip packages are now built with CUDA11.2 and cuDNN 8.1.0

However, the error tensorflow.python.framework.errors_impl.InternalError: No allocator statistics already occurs during a simple operation.
Complete Output:

>>> import tensorflow as tf
2021-04-20 15:39:08.410710: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2
>>> tf.__version__
'2.5.0-rc1'
>>> a = tf.constant(1.0)
2021-04-20 15:39:24.562570: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-20 15:39:24.625868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 48.00GiB deviceMemoryBandwidth: 625.94GiB/s
2021-04-20 15:39:24.633782: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-20 15:39:24.646338: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-04-20 15:39:24.649997: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-04-20 15:39:24.657917: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-04-20 15:39:24.663828: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-04-20 15:39:24.676594: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_11.dll
2021-04-20 15:39:24.683808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-04-20 15:39:24.688977: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-04-20 15:39:24.693099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-04-20 15:39:24.696781: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-20 15:39:24.710986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 48.00GiB deviceMemoryBandwidth: 625.94GiB/s
2021-04-20 15:39:24.719253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-04-20 15:39:25.394786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-20 15:39:25.399375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-04-20 15:39:25.401865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-04-20 15:39:25.404486: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:210] Using CUDA malloc Async allocator for GPU.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\gat\.conda\envs\TF2.5_TEST\lib\site-packages\tensorflow\python\framework\constant_op.py", line 264, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "C:\Users\gat\.conda\envs\TF2.5_TEST\lib\site-packages\tensorflow\python\framework\constant_op.py", line 276, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "C:\Users\gat\.conda\envs\TF2.5_TEST\lib\site-packages\tensorflow\python\framework\constant_op.py", line 301, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "C:\Users\gat\.conda\envs\TF2.5_TEST\lib\site-packages\tensorflow\python\framework\constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "C:\Users\gat\.conda\envs\TF2.5_TEST\lib\site-packages\tensorflow\python\eager\context.py", line 525, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: No allocator statistics

Some general questions regarding TF_GPU_ALLOCATOR=cuda_malloc_async
Could you also explain how the new allocator works? For example: When will the gpu ram be released when using the new allocator option? Must and or can this be done manually? Will it also work in the C-API? With which TF version will it be introduced?

Thank you!

ruler501 · 2021-05-01T20:32:09Z

I'm seeing the same issue with cuda_malloc_async. I'm on Linux with kernel 5.11.16, Nvidia driver 465.27-2, Cuda 11.2.2-2, Cudnn 8.2.0.53-1, and python 3.8.8. Tensorflow is tf-nightly-2.6.0.dev20210501.

sanjoy · 2021-05-20T06:28:15Z

I believe this is fixed by #49173 (CC @nouiz )

nouiz · 2021-05-20T16:30:48Z

@sanjoy @GatGit12

PR #49173 that is approved, but not yet merged should fix the tensorflow.python.framework.errors_impl.InternalError: No allocator statistics error.

When it is merged, the next day you can probably use TF nightly build to test it again.

Note, this new allocator can solve the issue, but this ask that all used library use cudaMallocAsync (EDIT: or cudaMalloc)
Do you know how memory allocation is handled by your other library?

nouiz · 2021-05-20T17:50:53Z

I must correct myself. In fact, it should work if the other lib use cudaMallocAsync or cudaMalloc. But it won't if the other lib keep its unused memory instead of freeing it, like PyTorch pool allocator or TF GPU suballocator, unless that unused memory is "freed".

Memory in cudaMallocAsync pool can also be released implicitly by the CUDA driver in order to allow an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc() or cuMemCreate() could cause CUDA to free unused memory from any memory pool associated with the device in the same process in order to serve the request.

hanshengchiu · 2021-07-15T11:56:04Z

Is cuda_malloc_async supported in TF 2.5.0? I got InternalError: No allocator statistics with it.

nouiz · 2021-07-15T13:45:15Z

It was bugged in TF2.5. It should work in TF2.6. You can try TF2.6.0-rc1 that was released recently:
https://github.com/tensorflow/tensorflow/releases

nouiz · 2021-07-15T13:45:54Z

Note, this issue should be closed as the bug is merged. I do not have the right to close it.

sushreebarsa · 2021-07-16T05:56:08Z

@GatGit12 could you please let us know if this issue is fixed for you as per this comment,thanks!

hanshengchiu · 2021-07-16T12:53:15Z

It was bugged in TF2.5. It should work in TF2.6. You can try TF2.6.0-rc1 that was released recently:
https://github.com/tensorflow/tensorflow/releases

Tested with 2.6.0-rc1 and got the following error with TF_GPU_ALLOCATOR=cuda_malloc_async:

I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:215] Using CUDA malloc Async allocator for GPU: 0
Process finished with exit code -1073740940 (0xC0000374)

Same error with TF_GPU_ALLOCATOR=cuda_malloc -
tensorflow/core/common_runtime/gpu/gpu_process_state.cc:205] Using CUDA malloc allocator for GPU.
Process finished with exit code -1073740940 (0xC0000374)

nouiz · 2021-07-16T13:18:45Z

Is it easy for you to share a reproduction?

nouiz · 2021-07-16T13:20:08Z

Also, which driver version do you use and which OS?

hanshengchiu · 2021-07-16T14:04:28Z

CUDA 11.2.0 and stock driver came with CUDA SDK.
OS: Windows 10 build 19042.1110.

hanshengchiu · 2021-07-16T15:39:06Z

I didn't test this on Windows as I do not have a Windows computer setup.
Can you try this small scripts. It enable extra logging, So maybe it will help understand where it crashes:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
import tensorflow as tf
a = tf.zeros([], tf.float32)

Same outcome. But this time I ran with CUDA 11.2.2:

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
>>> import os
>>> os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
>>> os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
>>> import tensorflow as tf
>>> a = tf.zeros([], tf.float32)
2021-07-16 23:36:54.937204: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-16 23:36:55.475806: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:215] Using CUDA malloc Async allocator for GPU: 0
Process finished with exit code -1073740940 (0xC0000374)

nouiz · 2021-07-16T16:27:58Z

Can you try with CUDA 11.3 or CUDA 11.4? I do not have the setup to test on window. I do not know when someone can check it.
Also, how do you use CUDA on Windows? There is a few different way to do this.

hanshengchiu · 2021-07-16T18:33:02Z

Can you try with CUDA 11.3 or CUDA 11.4? I do not have the setup to test on window. I do not know when someone can check it.
Also, how do you use CUDA on Windows? There is a few different way to do this.

11.4 gave the same error, even with cuda_malloc.
On the other hand, I was able to use cuda_malloc with TF2.5.0.

ruler501 · 2021-07-17T18:01:49Z

I'm getting the same with 11.2 and 11.4 with latest drivers on windows. 26.0-rc1 crashes with the debug dump reporting python heap corruption with cuda_malloc or cuda_malloc_async. I can provide the dump file if that would help.

CUDA was installed with the default windows 10 installer from Nvidia and path envvar was updated to the relevant directories. Cudnn was installed by copying the files from the Nvidia zip file to the appropriate locations.

I am running the developer preview which is windows 11 so it's possible that's the problem.

hanshengchiu · 2021-07-19T10:08:44Z

I'm getting the same with 11.2 and 11.4 with latest drivers on windows. 26.0-rc1 crashes with the debug dump reporting python heap corruption with cuda_malloc or cuda_malloc_async. I can provide the dump file if that would help.

CUDA was installed with the default windows 10 installer from Nvidia and path envvar was updated to the relevant directories. Cudnn was installed by copying the files from the Nvidia zip file to the appropriate locations.

I am running the developer preview which is windows 11 so it's possible that's the problem.

Not likely Windows 11's issue. I'm running Windows 10 without insider preview.

hanshengchiu · 2021-07-19T10:11:21Z

Can you try with CUDA 11.3 or CUDA 11.4? I do not have the setup to test on window. I do not know when someone can check it.
Also, how do you use CUDA on Windows? There is a few different way to do this.

Not using container. Just install CUDA SDK (but without NVTX and PhyS), cudnn, and stock driver that came with SDK.

nouiz · 2021-07-21T17:22:35Z

Do you have 1 or multiple GPUs? Which GPU(s) do you have?

ruler501 · 2021-07-21T20:00:29Z

I have a 2080 TI as the only connected GPU and an AMD 3950X processor.

hanshengchiu · 2021-07-22T08:49:11Z

I have single GTX1060 and intel i7-8750H

gilfree · 2021-07-27T10:37:17Z

I also fail to run the mentioned code.

Machine setup is: cuda 11.4, Tensorflow nightly build - tf_nightly-2.7.0.dev20210727, python 3.7.5, 4xRTX2080 Ti, nvidia driver 470.57.02

Running this code:

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
import tensorflow as tf
a = tf.zeros([], tf.float32)

Segfaults:

2021-07-27 13:13:25.349562: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:215] Using CUDA malloc Async allocator for GPU: 0
Segmentation fault (core dumped)

The callstack top is:

(gdb) bt
#0  0x000000000000002c in ?? ()
#1  0x00007ffdab0f8760 in tensorflow::GPUProcessState::GetGPUAllocator(tensorflow::GPUOptions const&, tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int>, unsigned long, std::vector<tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int>, std::allocator<tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int> > > const&) ()
   from REDACTED/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2

The same happens for cuda_malloc allocator. When no allocator is set - everything works.

When memory_guard is set, I get:

2021-07-27 13:32:07.880641: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:200] Using memory guard allocator for GPU.
2021-07-27 13:32:07.880692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1504] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9672 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:60:00.0, compute capability: 7.5
2021-07-27 13:32:08.077423: F tensorflow/core/framework/tensor.cc:682] Check failed: IsAligned() Aligned and single element
Aborted (core dumped)

The machine is multi-gpu machine, but I set CUDA_VISIBLE_DEVICES=0

I will gladly provide any other information needed - I would really like this feature to land

nouiz · 2021-07-29T21:27:15Z

Quick update. It was also crashing on Linux. I made this PR to fix the linux crashed:
#50961

Do you know if there is nightly build for windows for you to test it?

nouiz · 2021-08-30T15:07:33Z

The fix for cudaMallocAsync was merged a few hours ago.
Can you wait 24h and try TF nightly build to be sure that it also works for you?
If you have any comments on that new features, please share with us.

To enable it, use the environment variable TF_GPU_ALLOCATOR=cuda_malloc_async.

Please share your result on this new feature.

Note, with this new allocator, the cuda driver will release automatically reserved, but not used memory when other library in the same process make a cudaMalloc calls that miss memory. So you do not need to trigger a trim command yourself.

meslane · 2021-09-06T21:11:42Z

The fix for cudaMallocAsync was merged a few hours ago.
Can you wait 24h and try TF nightly build to be sure that it also works for you?
If you have any comments on that new features, please share with us.

To enable it, use the environment variable TF_GPU_ALLOCATOR=cuda_malloc_async.

Please share your result on this new feature.

Note, with this new allocator, the cuda driver will release automatically reserved, but not used memory when other library in the same process make a cudaMalloc calls that miss memory. So you do not need to trigger a trim command yourself.

Got the same error as the other replies in this thread, and reinstalling 11.4 changed nothing. After downloading the nightly build it started working again. Good work.

nouiz · 2021-09-07T13:20:12Z

@meslane thanks for the confirmation. Can you tell me which OS you are using?

meslane · 2021-09-07T17:41:09Z

@nouiz I am on Windows 10. GPU is an RTX2060 and CPU is an i7-4790k.

gilfree · 2021-09-09T05:28:30Z

I also fail to run the mentioned code.

Machine setup is: cuda 11.4, Tensorflow nightly build - tf_nightly-2.7.0.dev20210727, python 3.7.5, 4xRTX2080 Ti, nvidia driver 470.57.02

Running this code:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
import tensorflow as tf
a = tf.zeros([], tf.float32)

Works on nightly, on the same system.

Thanks!

nouiz · 2021-09-09T13:20:50Z

@gilfree Thanks for the confirmation.
@GatGit12 if you have time that this resolve your original issue, it would be great.

Vadim-Maklakov · 2021-09-17T20:17:15Z

Hi, all!
Continue this topic.

In my system configuration:
Tensorflow version: 2.7.0-dev20210917 from pip
Python version: 3.8.12 from source clang-12
Local host OS : Linux Debian 10 Linux-4.19.0-17-amd64-x86_64-
IPython local host version: 7.27.0
CPU: AMD FX-8350
GPU: GeForce GT 1050 2GB (sm_61)
Driver Version: 470.57.02 CUDA Version: 11.4 CUDNN 8.2.2
this magic code

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
import tensorflow as tf
a = tf.zeros([], tf.float32)

didn't turn into in the silver bullet

When I tried compile this simple code which created, deleted and re-created a simple model in the Jupyter Notebook

# Create model
model = keras.Sequential([
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')])


    # Compile model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
              

              # Train and validation dataset
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]


# Train and validation model
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
validation_data=(x_val, y_val))


# Del model above and recreate with different numbers of epoches

# Delete model above
del model

# Recreate new model for final evaluation
model = keras.Sequential([
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')])

model.compile(optimizer='rmsprop',
              loss='mse',
              metrics=['acc'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test, return_dict=True)
mp_test = model.predict(x_test)

I get this result for Tensorflow 2.7.0-dev (short version).

2021-09-17 21:07:21.444997: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 1000000000 exceeds 10% of free system memory.
2021-09-17 21:07:32.198909: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 953.67MiB (rounded to 1000000000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
.......................
2021-09-17 21:07:32.202662: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 1467809792 memory_limit_: 1467809792 available bytes: 0 curr_region_allocation_bytes_: 2935619584
2021-09-17 21:07:32.202700: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      1467809792
InUse:                       601411072
MaxInUse:                   1024997120
NumAllocs:                       47638
MaxAllocSize:                600000000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-09-17 21:07:32.202734: W tensorflow/core/common_runtime/bfc_allocator.cc:468]
......................
InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

This means that the direct command to delete the model object didn't flush the GPU memory in the TF2.7.0-dev as in the TF2.6.0

For some reason, the same code in 2,5,1 above (without allocators) showed warnings in the last lines, but it was executed without problems and errors.

In the my case the miracle did not happen and the elephant did not fly. I end up daring creative experiments that consumption a lot of time and downgrade to 2.5.1.

If the output listings for this simple code for TF 2.5.1 and 2.7.0-dev are interesting - see the attachment as Jupyter Notebooks.

Best Regards,
Vadim Maklakov.

TF251-270-dev_CUDA_11.4.ipynb.tar.gz

nouiz · 2021-09-17T23:57:15Z

Thanks for this long description.
I suppose that you tried cuda_malloc_async due to OOM with TF 2.7.
cuda_malloc_async won't help if the model isn't deleted.

Note, in your example, you use the first time partial_x_train and partial_y_train. But the second time you use x_train and y_train that are bigger. So the next time, it is normal that it request more memory. And so maybe the OOM is a real OOM.

Vadim-Maklakov · 2021-09-18T11:11:28Z

I suppose that you tried cuda_malloc_async due to OOM with TF 2.7.

First cell code in the my Jupyter Notebook TF2.7.0-dev was silver bullet:

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
os.environ["TF_CPP_VMODULE"]="gpu_process_state=10,gpu_cudamallocasync_allocator=10"
import tensorflow as tf
a = tf.zeros([], tf.float32)

but elephant doesn't take off...
In the TF 2.5.1 I can and not delete this simple model - I get only warning message.

nouiz · 2021-09-19T14:35:45Z

Your issue is different then this one and I do not think cuda_malloc_async can help you. You should make another issue for your problem.

Vadim-Maklakov · 2021-09-19T17:23:48Z

Formally GT1050 (sm_61) support CUDA 11.4 toolkit - hence silver bullet code must work correctly with TF2.7.0-dev and clear GPU memory when delete old object.

nouiz · 2021-09-20T12:56:07Z

There can be many different causes to OOM errors. So there can't be a silver bullet solution.
The problem you hit is that TF doesn't call the free on some GPU memory.

This issue is a different one. It is about how TF memory allocator works. cuda_malloc_async add the feature requested in this issue.
Your problem is something totally different then how allocator itself behave, it is how TF behave.
Reusing an issue not related to your problem won't get the attention your issue need. For that, you must find an existing issue that is the same as your or create a new issue.
Just create a new issue so that the right people can have a chance of hearing about your issue. I'm not the right person.

alexcoca · 2022-06-29T12:35:03Z

@nouiz I was wondering why you consider @Vadim-Maklakov problem to be different. As far as I understand, what he wants to do is to clear the GPU memory after a training loop in order to train the model again. So assuming that the device is capable of training on the entire dataset, one would expect to have a mechanism to clear the GPU memory to train the same model multiple times (which is why it is important to have the ability to "clear" GPU memory). Is this not what the current feature supports "aka clearing the GPU memory"? If I was to do this, I would just sequentially launch independent processes from the CLI to avoid pain.

However, I am doing research where I sadly developed an optimisation algorithm requiring two models. one in pytorch and one in tensorflow. These are implemented in the different frameworks in the transformers library so I can't do anything to work in the same framework. So my process is simply described as:

use a pytorch language model to generate a large number of sentence pairs
measure the similarity of those sentence pairs using a learned metric developed by Google researchers - the transformers library only offers a tensorflow implementation

The above process happens 3 times, so I need to clear the GPU memory after using the tf model so I can reload the language model. When I use pytorch models only, I can release the memory very easily and everything works. My expectation is that the new feature discussed in this thread should allow me to implement the above, so long I use delattr for the attribute in my class that stores the reference to the tf model. Am I delusional?

alexcoca · 2022-06-29T13:04:18Z

@nouiz according to #36465 I am in for bitter disappointment as this fix does not seem to solve a fundamental issue: deallocation of memory allocated to tf models once there are no more references to them. Is this feature impossible to implement for tensorflow @nouiz ?

alexcoca · 2022-06-30T10:32:04Z

Actually, in my case using cuda_malloc_async seems to have worked, code no longer breaks when attempting to load the language model the second time

Expendable0 · 2024-01-11T03:21:06Z

Was this problem never solved? I have a process that trains several very small models in succession (much smaller than what my GPU is capable of). After the 16th or so model, I get a memory issue every time.

I am using tensorflow 2.15 with CUDA 12.2:
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'

And I run this after ever model is trained (and saved to disc):

del model
gc.collect()
tf.keras.backend.clear_session()

GatGit12 added the type:feature Feature requests label Apr 15, 2021

google-ml-butler bot assigned UsharaniPagadala Apr 15, 2021

UsharaniPagadala added comp:gpu GPU related issues TF 2.4 for issues related to TF 2.4 labels Apr 15, 2021

UsharaniPagadala assigned amahendrakar and unassigned UsharaniPagadala Apr 15, 2021

amahendrakar added the stat:awaiting response Status - Awaiting response from author label Apr 15, 2021

amahendrakar removed the stat:awaiting response Status - Awaiting response from author label Apr 19, 2021

amahendrakar assigned ymodak and unassigned amahendrakar Apr 19, 2021

ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Apr 19, 2021

ymodak assigned sanjoy and unassigned ymodak Apr 19, 2021

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jul 16, 2021

pyu10055 mentioned this issue Oct 21, 2021

[tensorflow/tfjs][tfjs-node-gpu] cuda_malloc_async fails with CUDA device attribute error tensorflow/tfjs#5740

Closed

rohan100jain assigned reedwm Oct 17, 2022

cjmcclellan mentioned this issue Jan 17, 2024

How can I clear GPU memory in tensorflow 2? #36465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Comments