-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I clear GPU memory in tensorflow 2? #36465
Comments
@HristoBuyukliev, |
@amahendrakar Hi, this is not what I am looking for. Not using up all the memory at once sounds like a useful feature, however I am looking to clear the memory tf has already taken. I just tried it out, it doesn't help. I am iteratively increasing batch size, trying to find the biggest one I can use. Once the jupyter kernel crashes, the memory stays taken up. Additionally, even the advertised functionality does not work. I made a model that had two times fewer parameters, tensorflow still took up 31 out of 32 gigabytes. |
Hello @HristoBuyukliev, I had a similar problem when I was iterating over model.predict(), if you are iteratively increasing batch size, try after each batch_size training do |
You may try limiting gpu memory growth in this case. import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
# your code |
Hi @HristoBuyukliev , this is a very old issue that everyone is facing in TF 1.x as well as TF 2.x, it seems to be a design flaw and the TF team doesn't seem to care about fixing (I have been facing this issue for more than 2 years now). What worked well for me was just to run my train/eval in a separate process and wait for it to finish. So when the process finishes the system kills it and releases the GPU resources automatically.
|
@ymodak
@taborda11 Thank you for your suggestion, unfortunately it did not work. @taborda11 @EKami a teammate of mine found a hacky solution that kind of works:
This gets all the python processes that are using GPU2 in my case, and kills them. It works, but is very very ugly and I was hoping for a better way. |
How do you exit the TF processes? This looks like an issue with |
@sanjoy I think that nvidia-smi does not list GPU processes when used within Docker (as in my case) |
I see, thanks! How do you exit the TF processes? Based on what you've said so far, it looks like the TF processes are not being dying and the workaround is to find them via |
I have also been battling with the issue of releasing GPU memory for quite some time... My use case is a machine in a production environment with a single Python process that has to serve different types of clients and I need to switch models depending on the service to be provided. Thus, purging previous models from memory is mandatory in this case, otherwise resource exhausted errors appear sooner than later. I tried all combinations of As a last resort, I will try the solution proposed by @EKami to spawn a subprocess every time I need to switch models and I will report on how it goes. |
Replying to my own comment... I implemented the solution based on spawning a subprocess to run Tensorflow code and (as expected) it actually works, because all resources (particularly GPU memory) are released once the subprocess is destroyed. Of course, there are some drawbacks in terms of implementation complexity, since one has to deal with multi-processing related stuff that otherwise would not be needed, such as inter-process communication or logging from multiple processes. |
@EKami @mminervini Anyhow, could you point me to a good example / tutorial for the subprocess approach if you know of any? |
@phiwei Yep I came to the exact same conclusion but it's hard to move big projects which rely so much on TF/Keras to Pytorch. For my future projects, I won't do the same mistake tho and I can clearly see from the papers trends that it's where everyone is heading to. Even the argument of "TF is better suited for production" doesn't hold anymore, in fact we are shooting ourselves in the foot with bugs like this one which even after many years, are still not fixed. The future is JAX/Pytorch, TF is doomed to be a relic of the past at this rate. As for the subprocess tutorial, I don't have any to share but the small example I gave here: #36465 (comment) The bad news is: It seems that this solution doesn't work with TF 2.2 on RTX cards (yet another problem). It works well with RTX cards on TF 1.15.x and non-RTX cards on TF 2.2 (like nvidia T4). It seems to be driver related so maybe with the next driver release for RTX the issue will go away... no idea, we'll see, but at this point, I don't expect much. |
@EKami Thanks for the warning, I am in fact using TF 2.2 with an RTX card... I worked with PyTorch a lot two years ago and in my opinion, it was already a very mature tool that actually behaves the way you would want Python code to behave. Something I found very neat was that they by default use dicts for batches, loved that for customising models / handing information through models. Edit: |
Has there been any movement on this? |
Does this work with gpu? |
Yeah, it does |
I can confirm that in CPU mode Tensorflow does not release the memory either. I created a small experiment that creates a small model, and then executes |
@reedwm @sachinprasadhs @mohantym I don't want to come across as rude, but can we get a confirmation from you guys that you at least recognize this as an issue? We have provided graphs and data showing that TensorFlow has a memory leak, and yet every time someone comes along with a "solution" that doesn't even fix the problem at hand, someone from Google tries to close this thread. From my perspective, it's as if Google is trying to say one of two things:
|
It's just unbelievable to see this in year 2023. And you guys (core tensorflow devs) call it a production-ready machine learning framework? Learn how to clean up your crap (claimed GPU memory) first. It's disrespectful to your users. I am currently in sutuation where I need to train many models (not only ANNs, but also boostings etc, some of them also using GPU) on the same fold of data, then issue predictions with each model over a range of separate big files, proceed to the next fold etc. Spawning separate thread for every such training and every prediction task, importing tf, communicating results back to the main thread seems much more silly than just calling some helper funciton to clear allocated GPU memory. If you are able to allocate memory then you must be able to also release it easily. Find courage to listen to you users and do it. |
Not a fix but I use a docker container that I pull up each time I need to evaluate a model, once the evaluation is complete I break from the main program so the docker container is stopped. This releases all GPU resources. Use this container to enable GPU processing in your container (or a similar container with the correct cuda version for your gpu): The container can be pulled up with a subprocess call like: However, to enable GPU processing within the container add the arguments: |
@reedwm @sachinprasadhs Do you hear how ridiculous the above is? THAT is what this memory leak is making us do. |
I've escalated this to the TF team. |
I have been watching this thread for over a year. This is some basic stuff for devs to worry about instead of deprecating functions every release. That no one seems to think this is a priority makes me question the viability of Tensorflow in general. Probably time to take another look at PyTorch. On Aug 25, 2023, at 11:16 AM, Nathaniel Lane ***@***.***> wrote:
@reedwm @sachinprasadhs Do you hear how ridiculous the above is? THAT is what this memory leak is making us do.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@robotoil Whenever someone tells me, “I’m going to start a new project in TensorFlow,” I point them to this exact issue and tell them to use PyTorch instead |
I don’t think the devs have _any_ idea how much time is wasted by users trying to come up with workarounds. Imagine if you had to reboot your phone after each call. On Aug 25, 2023, at 11:50 AM, Nathaniel Lane ***@***.***> wrote:
@robotoil Whenever someone tells me, “I’m going to start a new project in TensorFlow,” I point them to this exact issue and tell them to use PyTorch instead
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
What a funny thing to have this persist 3.5 years now. for these years-old issues (a lot of companies have them), the cause it's usually something we're not seeing (usually something detrimental to them if they fix it). |
My first fix was to run in a subprocess. My second fix, which was more painful, was to migrate the models to ONNX and from that point, only use PyTorch for future models |
The fact that they haven’t even recognized it as an issue, despite the fact that we have supplied code and charts demonstrating that it’s an issue, makes me think that this is the case. |
This issue seems to have been fixed by setting |
@cjmcclellan From #36465 (comment) in this thread:
|
@cjmcclellan I think most of us have given up at this point :P As @nalane mentions, if you think this fixes the issue, demonstrate it through running @FirefoxMetzger's benchmark from earlier in this thread with the proposed fix (#36465 (comment)) |
Yeah TF team, please fix this. While the multiprocess trick works for single GPU, how does that work if you run distributed training? Wouldn't multiprocess be more likely to mess up the distributed training? If you run distributed training + Hyperparameter optimization you would be looking at your code eating through hundreds of GBs of RAM |
In my environment, specifying a strategy mitigate the problem. import tensorflow as tf
# set strategy
strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
# fit and evaluate under the specified strategy
with strategy.scope():
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model = tf.keras.Sequential([tf.keras.layers.Dense(1,)])
model.compile(loss='mse', optimizer='adam')
model.fit(dataset, epochs=10)
model.evaluate(dataset) |
System information
I created a model, nothing especially fancy in it. When I create the model, when using nvidia-smi, I can see that tensorflow takes up nearly all of the memory. When I try to fit the model with a small batch size, it successfully runs. When I fit with a larger batch size, it runs out of memory. Nothing unexpected so far.
However, the only way I can then release the GPU memory is to restart my computer. When I run nvidia-smi I can see the memory is still used, but there is no process using a GPU. Also, If I try to run another model, it fails much sooner.
Nothing in the first five pages of google results works. (and most solutions are for TF1)
Is there any way to release GPU memory in tensorflow 2?
The text was updated successfully, but these errors were encountered: