-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ability to clear GPU memory in Tensorflow 2 #48545
Comments
But this issue (#36465) is marked as a bug, with no attention and no explicit feature request, which is why i formally opened this post here as a feature request. Also there are several issues with the gpu ram clearning which are simply ignored... Here is a selection (without guarantee of completeness): #39535, #19571, #15880, #20387 |
Hi, We expect this to be a non-issue once we're using the CUDA malloc async allocator by default. Can you give it a try? You can enable it by adding |
Hi, could you please clarify with which TF version and CUDA version i can/should try this option? A quick search in this repo revealed that at least CUDA 11.2 is required:
But TF 2.4.0/2.4.1 is only built for CUDA 11.0 (https://www.tensorflow.org/install/source_windows#gpu). So I tried it with the current TensorFlow 2.5.0-rc1 which supports CUDA 11.2.
However, the error
Some general questions regarding Thank you! |
I'm seeing the same issue with |
PR #49173 that is approved, but not yet merged should fix the When it is merged, the next day you can probably use TF nightly build to test it again. Note, this new allocator can solve the issue, but this ask that all used library use cudaMallocAsync (EDIT: or cudaMalloc) |
I must correct myself. In fact, it should work if the other lib use cudaMallocAsync or cudaMalloc. But it won't if the other lib keep its unused memory instead of freeing it, like PyTorch pool allocator or TF GPU suballocator, unless that unused memory is "freed". Memory in cudaMallocAsync pool can also be released implicitly by the CUDA driver in order to allow an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc() or cuMemCreate() could cause CUDA to free unused memory from any memory pool associated with the device in the same process in order to serve the request. |
Is cuda_malloc_async supported in TF 2.5.0? I got InternalError: No allocator statistics with it. |
It was bugged in TF2.5. It should work in TF2.6. You can try TF2.6.0-rc1 that was released recently: |
Note, this issue should be closed as the bug is merged. I do not have the right to close it. |
@GatGit12 could you please let us know if this issue is fixed for you as per this comment,thanks! |
Tested with 2.6.0-rc1 and got the following error with TF_GPU_ALLOCATOR=cuda_malloc_async:
Same error with TF_GPU_ALLOCATOR=cuda_malloc - |
Is it easy for you to share a reproduction? |
Also, which driver version do you use and which OS? |
CUDA 11.2.0 and stock driver came with CUDA SDK. |
Same outcome. But this time I ran with CUDA 11.2.2:
|
Can you try with CUDA 11.3 or CUDA 11.4? I do not have the setup to test on window. I do not know when someone can check it. |
11.4 gave the same error, even with cuda_malloc. |
I'm getting the same with 11.2 and 11.4 with latest drivers on windows. 26.0-rc1 crashes with the debug dump reporting python heap corruption with cuda_malloc or cuda_malloc_async. I can provide the dump file if that would help. CUDA was installed with the default windows 10 installer from Nvidia and path envvar was updated to the relevant directories. Cudnn was installed by copying the files from the Nvidia zip file to the appropriate locations. I am running the developer preview which is windows 11 so it's possible that's the problem. |
Not likely Windows 11's issue. I'm running Windows 10 without insider preview. |
Not using container. Just install CUDA SDK (but without NVTX and PhyS), cudnn, and stock driver that came with SDK. |
Do you have 1 or multiple GPUs? Which GPU(s) do you have? |
I have a 2080 TI as the only connected GPU and an AMD 3950X processor. |
I have single GTX1060 and intel i7-8750H |
I also fail to run the mentioned code. Machine setup is: cuda 11.4, Tensorflow nightly build - tf_nightly-2.7.0.dev20210727, python 3.7.5, 4xRTX2080 Ti, nvidia driver 470.57.02 Running this code:
Segfaults: 2021-07-27 13:13:25.349562: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:215] Using CUDA malloc Async allocator for GPU: 0
Segmentation fault (core dumped) The callstack top is: (gdb) bt
#0 0x000000000000002c in ?? ()
#1 0x00007ffdab0f8760 in tensorflow::GPUProcessState::GetGPUAllocator(tensorflow::GPUOptions const&, tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int>, unsigned long, std::vector<tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int>, std::allocator<tensorflow::gtl::IntType<tensorflow::TfDeviceId_tag_, int> > > const&) ()
from REDACTED/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.2 The same happens for cuda_malloc allocator. When no allocator is set - everything works. When memory_guard is set, I get:
The machine is multi-gpu machine, but I set I will gladly provide any other information needed - I would really like this feature to land |
Quick update. It was also crashing on Linux. I made this PR to fix the linux crashed: Do you know if there is nightly build for windows for you to test it? |
The fix for cudaMallocAsync was merged a few hours ago. To enable it, use the environment variable TF_GPU_ALLOCATOR=cuda_malloc_async. Please share your result on this new feature. Note, with this new allocator, the cuda driver will release automatically reserved, but not used memory when other library in the same process make a cudaMalloc calls that miss memory. So you do not need to trigger a trim command yourself. |
Got the same error as the other replies in this thread, and reinstalling 11.4 changed nothing. After downloading the nightly build it started working again. Good work. |
@meslane thanks for the confirmation. Can you tell me which OS you are using? |
@nouiz I am on Windows 10. GPU is an RTX2060 and CPU is an i7-4790k. |
Works on nightly, on the same system. Thanks! |
Hi, all! In my system configuration:
didn't turn into in the When I tried compile this simple code which created, deleted and re-created a simple model in the Jupyter Notebook
I get this result for Tensorflow 2.7.0-dev (short version).
This means that the direct command to delete the model object didn't flush the GPU memory in the TF2.7.0-dev as in the TF2.6.0 For some reason, the same code in 2,5,1 above (without allocators) showed warnings in the last lines, but it was executed without problems and errors. In the my case the miracle did not happen and the elephant did not fly. I end up daring creative experiments that consumption a lot of time and downgrade to 2.5.1. If the output listings for this simple code for TF 2.5.1 and 2.7.0-dev are interesting - see the attachment as Jupyter Notebooks. Best Regards, |
Thanks for this long description. Note, in your example, you use the first time partial_x_train and partial_y_train. But the second time you use x_train and y_train that are bigger. So the next time, it is normal that it request more memory. And so maybe the OOM is a real OOM. |
First cell code in the my Jupyter Notebook TF2.7.0-dev was silver bullet:
but elephant doesn't take off... |
Your issue is different then this one and I do not think cuda_malloc_async can help you. You should make another issue for your problem. |
Formally GT1050 (sm_61) support CUDA 11.4 toolkit - hence |
There can be many different causes to OOM errors. So there can't be a silver bullet solution. This issue is a different one. It is about how TF memory allocator works. cuda_malloc_async add the feature requested in this issue. |
@nouiz I was wondering why you consider @Vadim-Maklakov problem to be different. As far as I understand, what he wants to do is to clear the GPU memory after a training loop in order to train the model again. So assuming that the device is capable of training on the entire dataset, one would expect to have a mechanism to clear the GPU memory to train the same model multiple times (which is why it is important to have the ability to "clear" GPU memory). Is this not what the current feature supports "aka clearing the GPU memory"? If I was to do this, I would just sequentially launch independent processes from the CLI to avoid pain. However, I am doing research where I sadly developed an optimisation algorithm requiring two models. one in pytorch and one in tensorflow. These are implemented in the different frameworks in the
The above process happens 3 times, so I need to clear the GPU memory after using the |
Actually, in my case using |
Was this problem never solved? I have a process that trains several very small models in succession (much smaller than what my GPU is capable of). After the 16th or so model, I get a memory issue every time. I am using tensorflow 2.15 with CUDA 12.2: And I run this after ever model is trained (and saved to disc):
|
Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template
System information
Describe the feature and the current behavior/state.
Currently there is no way to completely free the (once) allocated GPU RAM.
For example, i want to use tensorflow in the context of 3d visualization which is made next to impossible by this behavior. Standard solutions like
tf.config.experimental.set_memory_growth(gpus[0], True)
are unfortunately not sufficient, because the once allocated RAM cannot be released again.In #36465 (#36465 (comment)), it is mentioned that by using
GPUProcessState::TestOnlyReset
andProcessState::TestOnlyReset
the option to release GPU memory exists, but is just not exposed or for testing purposes only.It would be very nice for applications using tensorflow to have proper access to gpu ram release functions.
Will this change the current api? How?
Introduce a new (experimental) function to reset the current session/graph/device/... - state and thus free the GPU RAM completely.
Who will benefit with this feature?
People who use Tensorflow in their application in conjunction with other GPU-RAM critical operations such as 3D rendering.
The text was updated successfully, but these errors were encountered: