-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow full deallocation of GPU memory - like in Catboost #19571
Comments
For fellow TF users here I illustrate - using the sample code provided above - the two currently available methods to minimize and limit GPU memory usage (which can be combined together):
Note: to induce python's GPU memory usage to amounts visible in nvidia-smi (larger than the default minimum of 200 MiB used by most frameworks) we need to increase the number of samples in the example above to at least 20k:
|
I'm not sure whether to categorize this as a duplicate of #15880 or an independent feature request. My best understanding is that you simply want a lightweight solution to shrinking the GPU memory allocation without completely killing the TF process. Towards that end, I think @zheng-xq 's suggestion about extending Session.reset() sounds most promising. I don't understand how catboost provides a model solution. I'm not familiar with that program. |
It is a duplicate with a twist. I'm not calling for lower utilization of GPU - it is perfectly fine to claim all of it, if there is an option to limit the usage to a certain percentage (per_process_gpu_memory_fraction). What's useful in catboost's approach is the idea to deallocate GPU memory after returning results from the GPU back to CPU. BTW you have already something in common with catboost: you are the only two GPU-enabled frameworks (among the 12 or so I've tested for our containers at work) that give such control over GPU memory usage to the final user - something that even docker is not allowing us to do. I noticed that it is customary among deep learning frameworks to create such GPU memory leaks (which it is) - none of them release GPU memory, requiring users to restart python kernel to release memory. Incidentally, none apart from Tensorflow claims nearly all of it, but that's what's so special about you:). In contrast to DNNs, GBDT frameworks tend to release what they no longer need - that applies not only to catboost but also to lightgbm. IMO it would be good to contact catboost's GPU guy, Vasily (Noxomo) from Yandex team, because catboost manages to not only allocate 95% of GPU memory but also to use 100% of GPU compute power as well (despite sequential nature of boosting itself), which I've seen only in one DNN - the sadly neglected Theano... |
Nagging Assignee @poxvoculi: It has been 18 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
As pointed out in the #15880 thread, it's not plausible to immediately and automatically release GPU memory after GPU ops complete because TF has persistent state objects (Variables) that live in GPU memory and are not restricted to any special sub-region. |
Nagging Assignee @poxvoculi: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Nagging Assignee @poxvoculi: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
But there is certainly excessive greediness in Tensorflow’s memory allocation: it claims all RAM on all available GPU’s, but only uses by default one GPU – the one with the lowest ID. |
System information
no (slightly modified)
Ubuntu 16.04
binary
tensorflow-gpu 1.8.0
Python version: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
N/A
N/A
9.0.176 / 7.1.2
GPU[0] GeForce GTX 1080 Ti
Describe the problem
Same issue as #15880 here, with a fully reproducible example using latest TF 1.8 with CUDA 9.0 and cuDNN 7.1 on Ubuntu 16.04. So same old story, but this time I'm giving you a model solution to GPU memory management - the Catboost library by Yandex.
I confirm that Tensorflow does not release GPU memory after preallocating most of the available VRAM
(leaving only a few percent free). This memory should be freed by TF immediately after use for other modeling frameworks to use, so it is a bug that needs to be repaired. To see how it is done, refer to how Catboost manages GPU resources.
If you are using Jupyter Notebook, restarting python kernel is relatively easy and it "solves" the problem, but of course at the cost of losing all your data loaded to CPU memory.
The text was updated successfully, but these errors were encountered: