How can I clear GPU memory in tensorflow 2? #36465

HristoBuyukliev · 2020-02-04T15:16:15Z

System information

Custom code; nothing exotic though.
Ubuntu 18.04
installed from source (with pip)
tensorflow version v2.1.0-rc2-17-ge5bf8de
3.6
CUDA 10.1
Tesla V100, 32GB RAM

I created a model, nothing especially fancy in it. When I create the model, when using nvidia-smi, I can see that tensorflow takes up nearly all of the memory. When I try to fit the model with a small batch size, it successfully runs. When I fit with a larger batch size, it runs out of memory. Nothing unexpected so far.

However, the only way I can then release the GPU memory is to restart my computer. When I run nvidia-smi I can see the memory is still used, but there is no process using a GPU. Also, If I try to run another model, it fails much sooner.

Nothing in the first five pages of google results works. (and most solutions are for TF1)

Is there any way to release GPU memory in tensorflow 2?

amahendrakar · 2020-02-05T09:30:13Z

@HristoBuyukliev,
Could you please check this Tensorflow documentation and let us know if it helps. Thanks!

HristoBuyukliev · 2020-02-05T09:45:35Z

@amahendrakar Hi, this is not what I am looking for. Not using up all the memory at once sounds like a useful feature, however I am looking to clear the memory tf has already taken.

I just tried it out, it doesn't help. I am iteratively increasing batch size, trying to find the biggest one I can use. Once the jupyter kernel crashes, the memory stays taken up.

Additionally, even the advertised functionality does not work. I made a model that had two times fewer parameters, tensorflow still took up 31 out of 32 gigabytes.

taborda11 · 2020-02-05T13:29:09Z

Hello @HristoBuyukliev, I had a similar problem when I was iterating over model.predict(), if you are iteratively increasing batch size, try after each batch_size training do tf.keras.backend.clear_session().
That seems to be a case of memory leak in each training.

ymodak · 2020-02-05T22:52:20Z

You may try limiting gpu memory growth in this case.
Put following snippet on top of your code;

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
# your code

EKami · 2020-02-06T05:54:28Z

Hi @HristoBuyukliev , this is a very old issue that everyone is facing in TF 1.x as well as TF 2.x, it seems to be a design flaw and the TF team doesn't seem to care about fixing (I have been facing this issue for more than 2 years now).

What worked well for me was just to run my train/eval in a separate process and wait for it to finish. So when the process finishes the system kills it and releases the GPU resources automatically.
You can achieve this by doing something like:

import multiprocessing

process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()

HristoBuyukliev · 2020-02-06T08:26:37Z

@ymodak
As I also said to amahendrakar:

This seems like a nice feature, but not relevant to my problem.
Tried it anyway, did not work.

@taborda11 Thank you for your suggestion, unfortunately it did not work.
@EKami Yes, I figured by now there is no solution. Thank you for your suggestion, I will try it out.

@taborda11 @EKami a teammate of mine found a hacky solution that kind of works:

for i in $(sudo lsof /dev/nvidia2 | grep python | awk '{print $2}' | sort -u); do sudo kill -9 $i; done

This gets all the python processes that are using GPU2 in my case, and kills them. It works, but is very very ugly and I was hoping for a better way.

sanjoy · 2020-02-07T21:53:24Z

However, the only way I can then release the GPU memory is to restart my computer.

How do you exit the TF processes?

This looks like an issue with nvidia-smi based on your last comment. If lsof /dev/nvidia2 can find the processes using the GPU then nvidia-smi should find them as well.

HristoBuyukliev · 2020-02-09T15:17:04Z

@sanjoy I think that nvidia-smi does not list GPU processes when used within Docker (as in my case)

sanjoy · 2020-02-16T05:36:58Z

@sanjoy I think that nvidia-smi does not list GPU processes when used within Docker (as in my case)

I see, thanks!

How do you exit the TF processes? Based on what you've said so far, it looks like the TF processes are not being dying and the workaround is to find them via lsof /dev/nvidia2 and to kill -9 them manually. So there may be something wrong with how they are being stopped normally.

mminervini · 2020-05-08T09:35:14Z

I have also been battling with the issue of releasing GPU memory for quite some time...

My use case is a machine in a production environment with a single Python process that has to serve different types of clients and I need to switch models depending on the service to be provided. Thus, purging previous models from memory is mandatory in this case, otherwise resource exhausted errors appear sooner than later.
With TF 1.x and Keras (when it was separate from TF) I managed to make this work with keras.backend.clear_session(). At some point, I decided to finally move to TF 2.x with Keras integrated into TF (tf.keras) and then clearing GPU memory apparently became an impossible thing to do! I got the impression that something broke in TF memory management when Keras was integrated into TF.

I tried all combinations of tf.keras.backend.clear_session(), tf.compat.v1.reset_default_graph(), gc.collect(), close() the session, tf.compat.v1.disable_eager_execution(), and other solutions that I found online, but none of these really solved the issue.

As a last resort, I will try the solution proposed by @EKami to spawn a subprocess every time I need to switch models and I will report on how it goes.
In any case, this introduces inter-process communication and complicates things unnecessarily, so I really hope the TF team will improve GPU memory management and offer a function to really clear the session!

mminervini · 2020-05-15T09:30:47Z

Replying to my own comment...

I implemented the solution based on spawning a subprocess to run Tensorflow code and (as expected) it actually works, because all resources (particularly GPU memory) are released once the subprocess is destroyed.

Of course, there are some drawbacks in terms of implementation complexity, since one has to deal with multi-processing related stuff that otherwise would not be needed, such as inter-process communication or logging from multiple processes.
Performance is also significantly affected, since every subprocess will need to import TF and other modules and load models on the GPU. So, this is definitely not suited for time-critical operations.

phiwei · 2020-05-20T09:19:18Z

@EKami @mminervini
I have been struggling with this issue for an amount of time that is way beyond reasonable at this point as well... PyTorch did have working solutions for this already two years ago, but I am stuck with TF for now... if you can make the switch, I can warmly recommend it.

Anyhow, could you point me to a good example / tutorial for the subprocess approach if you know of any?

EKami · 2020-05-20T09:29:45Z

@phiwei Yep I came to the exact same conclusion but it's hard to move big projects which rely so much on TF/Keras to Pytorch. For my future projects, I won't do the same mistake tho and I can clearly see from the papers trends that it's where everyone is heading to. Even the argument of "TF is better suited for production" doesn't hold anymore, in fact we are shooting ourselves in the foot with bugs like this one which even after many years, are still not fixed.

The future is JAX/Pytorch, TF is doomed to be a relic of the past at this rate.

As for the subprocess tutorial, I don't have any to share but the small example I gave here: #36465 (comment)

The bad news is: It seems that this solution doesn't work with TF 2.2 on RTX cards (yet another problem). It works well with RTX cards on TF 1.15.x and non-RTX cards on TF 2.2 (like nvidia T4). It seems to be driver related so maybe with the next driver release for RTX the issue will go away... no idea, we'll see, but at this point, I don't expect much.

phiwei · 2020-05-20T09:33:39Z

@EKami Thanks for the warning, I am in fact using TF 2.2 with an RTX card... I worked with PyTorch a lot two years ago and in my opinion, it was already a very mature tool that actually behaves the way you would want Python code to behave. Something I found very neat was that they by default use dicts for batches, loved that for customising models / handing information through models.

Edit:
On that note though, Keras-Tuner works for me with RTX and TF 2.2, so there must be some way to accomplish this.

nalane · 2023-05-02T15:04:37Z

Has there been any movement on this?

AmosDinh · 2023-06-25T21:38:08Z

Hi @HristoBuyukliev , this is a very old issue that everyone is facing in TF 1.x as well as TF 2.x, it seems to be a design flaw and the TF team doesn't seem to care about fixing (I have been facing this issue for more than 2 years now).
What worked well for me was just to run my train/eval in a separate process and wait for it to finish. So when the process finishes the system kills it and releases the GPU resources automatically. You can achieve this by doing something like:
import multiprocessing

process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()
Here's how to get the return value of the process:
from multiprocessing import Process, Queue
import random

def my_func(arg, q):
    ret_val = random.random()
    print(ret_val, type(ret_val))
    q.put(ret_val)
    return 1


if __name__ == '__main__':
    queue = Queue()  # Here
    p1 = Process(target=my_func, args=('not used', queue))
    p1.start()
    p1.join()

    res = queue.get()
    print(res, type(res))

Does this work with gpu?

nalane · 2023-06-29T22:16:39Z

Does this work with gpu?

Yeah, it does

exgphe · 2023-07-12T19:06:07Z

I can confirm that in CPU mode Tensorflow does not release the memory either. I created a small experiment that creates a small model, and then executes del model, tf.keras.backend.clean_session() and gc.collect(), in a for loop. The memory usage grows steadily and will end up occupying all of your RAM. The only way to release the memory is to create the model in a separate process and then kill the process.

nalane · 2023-08-14T20:54:47Z

@reedwm @sachinprasadhs @mohantym I don't want to come across as rude, but can we get a confirmation from you guys that you at least recognize this as an issue? We have provided graphs and data showing that TensorFlow has a memory leak, and yet every time someone comes along with a "solution" that doesn't even fix the problem at hand, someone from Google tries to close this thread. From my perspective, it's as if Google is trying to say one of two things:

"We don't even care that people are telling us there is a memory leak," or
"We know it's a problem and are trying to sweep it under the rug."

fingoldo · 2023-08-23T09:10:08Z

It's just unbelievable to see this in year 2023. And you guys (core tensorflow devs) call it a production-ready machine learning framework? Learn how to clean up your crap (claimed GPU memory) first. It's disrespectful to your users. I am currently in sutuation where I need to train many models (not only ANNs, but also boostings etc, some of them also using GPU) on the same fold of data, then issue predictions with each model over a range of separate big files, proceed to the next fold etc. Spawning separate thread for every such training and every prediction task, importing tf, communicating results back to the main thread seems much more silly than just calling some helper funciton to clear allocated GPU memory. If you are able to allocate memory then you must be able to also release it easily. Find courage to listen to you users and do it.

Sam-Seaberry · 2023-08-25T15:11:39Z

Not a fix but I use a docker container that I pull up each time I need to evaluate a model, once the evaluation is complete I break from the main program so the docker container is stopped. This releases all GPU resources.

Use this container to enable GPU processing in your container (or a similar container with the correct cuda version for your gpu):
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu20.04

The container can be pulled up with a subprocess call like:
subprocess.run(['docker', 'start', container_name], stderr=subprocess.PIPE)

However, to enable GPU processing within the container add the arguments:
--gpus all

nalane · 2023-08-25T15:15:46Z

@reedwm @sachinprasadhs Do you hear how ridiculous the above is? THAT is what this memory leak is making us do.

jurahul · 2023-08-25T15:33:45Z

I've escalated this to the TF team.

robotoil · 2023-08-25T15:34:40Z

I have been watching this thread for over a year. This is some basic stuff for devs to worry about instead of deprecating functions every release. That no one seems to think this is a priority makes me question the viability of Tensorflow in general. Probably time to take another look at PyTorch. On Aug 25, 2023, at 11:16 AM, Nathaniel Lane ***@***.***> wrote: @reedwm @sachinprasadhs Do you hear how ridiculous the above is? THAT is what this memory leak is making us do. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

nalane · 2023-08-25T15:50:07Z

@robotoil Whenever someone tells me, “I’m going to start a new project in TensorFlow,” I point them to this exact issue and tell them to use PyTorch instead

robotoil · 2023-08-25T15:54:13Z

I don’t think the devs have _any_ idea how much time is wasted by users trying to come up with workarounds. Imagine if you had to reboot your phone after each call. On Aug 25, 2023, at 11:50 AM, Nathaniel Lane ***@***.***> wrote: @robotoil Whenever someone tells me, “I’m going to start a new project in TensorFlow,” I point them to this exact issue and tell them to use PyTorch instead —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

xucian · 2023-09-17T07:52:40Z

What a funny thing to have this persist 3.5 years now. for these years-old issues (a lot of companies have them), the cause it's usually something we're not seeing (usually something detrimental to them if they fix it).
Or maybe in this case they think that just starting the python in a subprocess is an acceptable solution (for me, it was at the time).

EKami · 2023-09-17T09:01:21Z

My first fix was to run in a subprocess. My second fix, which was more painful, was to migrate the models to ONNX and from that point, only use PyTorch for future models

nalane · 2023-09-17T15:35:53Z

the cause it's usually something we're not seeing (usually something detrimental to them if they fix it).

The fact that they haven’t even recognized it as an issue, despite the fact that we have supplied code and charts demonstrating that it’s an issue, makes me think that this is the case.

cjmcclellan · 2024-01-17T03:09:55Z

This issue seems to have been fixed by setting TF_GPU_ALLOCATOR=cuda_malloc_async as mentioned in #48545, which seems to work for me on TF 2.15 and CUDA 12.3. I'm not sure why this hasn't been mentioned here yet.

nalane · 2024-01-17T03:22:16Z

@cjmcclellan From #36465 (comment) in this thread:

I ran this on TF 2.8.0 on both Windows and Linux, on 3 different GPUs (with different CUDA versions), and with both the BFC memory manager and the experimental async allocator (TF_GPU_ALLOCATOR=cuda_malloc_async); results are consistent across all variations I've tried.

andreped · 2024-01-17T07:54:27Z

This issue seems to have been fixed by setting TF_GPU_ALLOCATOR=cuda_malloc_async as mentioned in #48545, which seems to work for me on TF 2.15 and CUDA 12.3. I'm not sure why this hasn't been mentioned here yet.

@cjmcclellan I think most of us have given up at this point :P

As @nalane mentions, if you think this fixes the issue, demonstrate it through running @FirefoxMetzger's benchmark from earlier in this thread with the proposed fix (#36465 (comment))

cjmcclellan · 2024-01-17T17:16:41Z

@andreped and @nalane thanks sorry I missed that comment. The async allocator fixed my issue, but clearly did not entirely fix the memory leak.

ElrondL · 2024-04-25T02:58:30Z

Yeah TF team, please fix this. While the multiprocess trick works for single GPU, how does that work if you run distributed training? Wouldn't multiprocess be more likely to mess up the distributed training? If you run distributed training + Hyperparameter optimization you would be looking at your code eating through hundreds of GBs of RAM

takeyama0 · 2024-05-31T09:43:48Z

In my environment, specifying a strategy mitigate the problem.
For example, in the single GPU case, the following not completely but significantly reduced memory leakage.

import tensorflow as tf

# set strategy
strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")

# fit and evaluate under the specified strategy
with strategy.scope():
    dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
    model = tf.keras.Sequential([tf.keras.layers.Dense(1,)])
    model.compile(loss='mse', optimizer='adam')
    model.fit(dataset, epochs=10)
    model.evaluate(dataset)

tensorflow-bot bot assigned amahendrakar Feb 4, 2020

amahendrakar added comp:gpu GPU related issues TF 2.1 for tracking issues in 2.1 release type:support Support issues stat:awaiting response Status - Awaiting response from author labels Feb 5, 2020

HristoBuyukliev changed the title ~~How can I clear GPU memory in tensorflow 2.0?~~ How can I clear GPU memory in tensorflow 2? Feb 5, 2020

amahendrakar assigned ymodak and unassigned amahendrakar Feb 5, 2020

ymodak added type:bug Bug and removed stat:awaiting response Status - Awaiting response from author type:support Support issues labels Feb 7, 2020

ymodak assigned sanjoy Feb 7, 2020

ymodak removed their assignment Feb 8, 2020

amaiya mentioned this issue Mar 1, 2020

How to release memory or free gpu. amaiya/ktrain#70

Closed

ahnjaewoo mentioned this issue Mar 4, 2020

Unable to remove model and release GPU memory #37289

Closed

ravikyram mentioned this issue May 15, 2020

Release GPU Memory(VRAM) after tf.keras.backend.clear_session() #39535

Closed

This was referenced Jul 3, 2023

Wrap bleurt function to avoid OOM in "all" mode CannyLab/vdtk#5

Closed

Fix tf in bleurt takes all the memory CannyLab/vdtk#6

Merged

sachinprasadhs unassigned mohantym Aug 23, 2023

jurahul assigned rohan100jain and unassigned reedwm and sachinprasadhs Aug 25, 2023

bkocis mentioned this issue Sep 15, 2023

Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations #61816

Closed

This was referenced Oct 3, 2023

update workflow versions rs-station/careless#139

Closed

use multiprocessing to bypass tf mem leak rs-station/careless#140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I clear GPU memory in tensorflow 2? #36465

How can I clear GPU memory in tensorflow 2? #36465

How can I clear GPU memory in tensorflow 2? #36465

How can I clear GPU memory in tensorflow 2? #36465

Comments

System information