Allow full deallocation of GPU memory - like in Catboost #19571

mirekphd · 2018-05-26T10:47:09Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
no (slightly modified)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 16.04
TensorFlow installed from (source or binary):
binary
TensorFlow version (use command below):
tensorflow-gpu 1.8.0
Python version:
Python version: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
Bazel version (if compiling from source):
N/A
GCC/Compiler version (if compiling from source):
N/A
CUDA/cuDNN version:
9.0.176 / 7.1.2
GPU model and memory:
GPU[0] GeForce GTX 1080 Ti
Exact command to reproduce:

import tensorflow as tf
import numpy
import matplotlib.pyplot as plt
import sklearn.datasets.samples_generator as sample_gen
rng = numpy.random

# Parameters
learning_rate = 0.01
training_epochs = 500
display_step = 50

# Training set
n_features=1
n_samples=100

# Training Data
# train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
#                          7.042,10.791,5.313,7.997,5.654,9.27,3.1])
# train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
#                          2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X, train_Y = sample_gen.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
# n_samples = train_X.shape[0]

# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")

# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)

# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Start training
with tf.Session() as sess:
    sess.run(init)

    # Fit all training data
    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: x, Y: y})

        #Display logs per epoch step
        if (epoch+1) % display_step == 0:
            c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), 
                "W=", sess.run(W), "b=", sess.run(b))

    print("Optimization Finished!")
    training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
    print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
    
    #Graphic display
    plt.plot(train_X, train_Y, 'ro', label='Original data')
    plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
    plt.legend()
    plt.show()
    plt.show()

Describe the problem

Same issue as #15880 here, with a fully reproducible example using latest TF 1.8 with CUDA 9.0 and cuDNN 7.1 on Ubuntu 16.04. So same old story, but this time I'm giving you a model solution to GPU memory management - the Catboost library by Yandex.

I confirm that Tensorflow does not release GPU memory after preallocating most of the available VRAM
(leaving only a few percent free). This memory should be freed by TF immediately after use for other modeling frameworks to use, so it is a bug that needs to be repaired. To see how it is done, refer to how Catboost manages GPU resources.

If you are using Jupyter Notebook, restarting python kernel is relatively easy and it "solves" the problem, but of course at the cost of losing all your data loaded to CPU memory.

The text was updated successfully, but these errors were encountered:

mirekphd · 2018-05-27T11:06:39Z

For fellow TF users here I illustrate - using the sample code provided above - the two currently available methods to minimize and limit GPU memory usage (which can be combined together):

allow_growth and
per_process_gpu_memory_fraction:

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Limit GPU memory use
cfg = tf.ConfigProto()
# allow dynamic GPU memory allocation (increased as needed,
# but not released once allocated - requiring python kernel restart)
cfg.gpu_options.allow_growth=True
# define the hard limit on GPU memory allocation
# (caution: can cause OOM errors)
cfg.gpu_options.per_process_gpu_memory_fraction = 0.20

# Start training (notice we used custom config)
with tf.Session(config=cfg) as sess:
    sess.run(init)
    [..]

Note: to induce python's GPU memory usage to amounts visible in nvidia-smi (larger than the default minimum of 200 MiB used by most frameworks) we need to increase the number of samples in the example above to at least 20k:

n_samples=20000

poxvoculi · 2018-05-29T17:20:57Z

I'm not sure whether to categorize this as a duplicate of #15880 or an independent feature request. My best understanding is that you simply want a lightweight solution to shrinking the GPU memory allocation without completely killing the TF process. Towards that end, I think @zheng-xq 's suggestion about extending Session.reset() sounds most promising. I don't understand how catboost provides a model solution. I'm not familiar with that program.

mirekphd · 2018-05-29T19:19:52Z

It is a duplicate with a twist. I'm not calling for lower utilization of GPU - it is perfectly fine to claim all of it, if there is an option to limit the usage to a certain percentage (per_process_gpu_memory_fraction). What's useful in catboost's approach is the idea to deallocate GPU memory after returning results from the GPU back to CPU. BTW you have already something in common with catboost: you are the only two GPU-enabled frameworks (among the 12 or so I've tested for our containers at work) that give such control over GPU memory usage to the final user - something that even docker is not allowing us to do.

I noticed that it is customary among deep learning frameworks to create such GPU memory leaks (which it is) - none of them release GPU memory, requiring users to restart python kernel to release memory. Incidentally, none apart from Tensorflow claims nearly all of it, but that's what's so special about you:). In contrast to DNNs, GBDT frameworks tend to release what they no longer need - that applies not only to catboost but also to lightgbm.

IMO it would be good to contact catboost's GPU guy, Vasily (Noxomo) from Yandex team, because catboost manages to not only allocate 95% of GPU memory but also to use 100% of GPU compute power as well (despite sequential nature of boosting itself), which I've seen only in one DNN - the sadly neglected Theano...

tensorflowbutler · 2018-06-17T18:41:52Z

Nagging Assignee @poxvoculi: It has been 18 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

poxvoculi · 2018-06-28T18:22:12Z

As pointed out in the #15880 thread, it's not plausible to immediately and automatically release GPU memory after GPU ops complete because TF has persistent state objects (Variables) that live in GPU memory and are not restricted to any special sub-region.

tensorflowbutler · 2018-07-13T18:58:45Z

Nagging Assignee @poxvoculi: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-07-28T18:50:26Z

Nagging Assignee @poxvoculi: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

mirekphd · 2018-08-02T16:05:03Z

But there is certainly excessive greediness in Tensorflow’s memory allocation: it claims all RAM on all available GPU’s, but only uses by default one GPU – the one with the lowest ID.

poxvoculi · 2018-08-02T16:36:38Z

https://www.tensorflow.org/guide/using_gpu

mirekphd · 2018-09-28T13:21:21Z

Allocating video memory on all GPUs but using only 1 GPU's cores looks like a bug that could be addressed (by setting up less greedy defaults) - it is particularly important because multi-tenant environments such as OpenShift do not allow us to limit video RAM usage or GPU cores usage (unlike standard CPU case), so the current approach leads to blocking all resources by just one user operating on default settings.

tensorflowbutler assigned poxvoculi May 26, 2018

poxvoculi added stat:awaiting response Status - Awaiting response from author type:feature Feature requests labels May 29, 2018

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 2, 2018

sjperkins mentioned this issue Jun 29, 2018

Tensorflow C++ not releasing GPU resources after closing the session #20387

Closed

poxvoculi closed this as completed Aug 1, 2018

byronyi mentioned this issue Oct 10, 2018

Dummy session trick incompatible with the new suballocator, causes gdr/verbs to fail tensorflow/benchmarks#257

Closed

GatGit12 mentioned this issue Apr 16, 2021

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow full deallocation of GPU memory - like in Catboost #19571

Allow full deallocation of GPU memory - like in Catboost #19571

Allow full deallocation of GPU memory - like in Catboost #19571

Allow full deallocation of GPU memory - like in Catboost #19571

Comments

System information

Describe the problem