[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow full deallocation of GPU memory - like in Catboost #19571

Closed
mirekphd opened this issue May 26, 2018 · 10 comments
Closed

Allow full deallocation of GPU memory - like in Catboost #19571

mirekphd opened this issue May 26, 2018 · 10 comments
Assignees
Labels
type:feature Feature requests

Comments

@mirekphd
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    no (slightly modified)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 16.04
  • TensorFlow installed from (source or binary):
    binary
  • TensorFlow version (use command below):
    tensorflow-gpu 1.8.0
  • Python version:
    Python version: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
    [GCC 7.2.0]
  • Bazel version (if compiling from source):
    N/A
  • GCC/Compiler version (if compiling from source):
    N/A
  • CUDA/cuDNN version:
    9.0.176 / 7.1.2
  • GPU model and memory:
    GPU[0] GeForce GTX 1080 Ti
  • Exact command to reproduce:
import tensorflow as tf
import numpy
import matplotlib.pyplot as plt
import sklearn.datasets.samples_generator as sample_gen
rng = numpy.random

# Parameters
learning_rate = 0.01
training_epochs = 500
display_step = 50

# Training set
n_features=1
n_samples=100

# Training Data
# train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
#                          7.042,10.791,5.313,7.997,5.654,9.27,3.1])
# train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
#                          2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X, train_Y = sample_gen.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
# n_samples = train_X.shape[0]

# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")

# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)

# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Start training
with tf.Session() as sess:
    sess.run(init)

    # Fit all training data
    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: x, Y: y})

        #Display logs per epoch step
        if (epoch+1) % display_step == 0:
            c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), 
                "W=", sess.run(W), "b=", sess.run(b))

    print("Optimization Finished!")
    training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
    print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
    
    #Graphic display
    plt.plot(train_X, train_Y, 'ro', label='Original data')
    plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
    plt.legend()
    plt.show()
    plt.show()

Describe the problem

Same issue as #15880 here, with a fully reproducible example using latest TF 1.8 with CUDA 9.0 and cuDNN 7.1 on Ubuntu 16.04. So same old story, but this time I'm giving you a model solution to GPU memory management - the Catboost library by Yandex.

I confirm that Tensorflow does not release GPU memory after preallocating most of the available VRAM
(leaving only a few percent free). This memory should be freed by TF immediately after use for other modeling frameworks to use, so it is a bug that needs to be repaired. To see how it is done, refer to how Catboost manages GPU resources.

If you are using Jupyter Notebook, restarting python kernel is relatively easy and it "solves" the problem, but of course at the cost of losing all your data loaded to CPU memory.

@mirekphd
Copy link
Author
mirekphd commented May 27, 2018

For fellow TF users here I illustrate - using the sample code provided above - the two currently available methods to minimize and limit GPU memory usage (which can be combined together):

  • allow_growth and
  • per_process_gpu_memory_fraction:
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Limit GPU memory use
cfg = tf.ConfigProto()
# allow dynamic GPU memory allocation (increased as needed,
# but not released once allocated - requiring python kernel restart)
cfg.gpu_options.allow_growth=True
# define the hard limit on GPU memory allocation
# (caution: can cause OOM errors)
cfg.gpu_options.per_process_gpu_memory_fraction = 0.20

# Start training (notice we used custom config)
with tf.Session(config=cfg) as sess:
    sess.run(init)
    [..]

Note: to induce python's GPU memory usage to amounts visible in nvidia-smi (larger than the default minimum of 200 MiB used by most frameworks) we need to increase the number of samples in the example above to at least 20k:

n_samples=20000

@poxvoculi
Copy link
Contributor

I'm not sure whether to categorize this as a duplicate of #15880 or an independent feature request. My best understanding is that you simply want a lightweight solution to shrinking the GPU memory allocation without completely killing the TF process. Towards that end, I think @zheng-xq 's suggestion about extending Session.reset() sounds most promising. I don't understand how catboost provides a model solution. I'm not familiar with that program.

@poxvoculi poxvoculi added stat:awaiting response Status - Awaiting response from author type:feature Feature requests labels May 29, 2018
@mirekphd
Copy link
Author
mirekphd commented May 29, 2018

It is a duplicate with a twist. I'm not calling for lower utilization of GPU - it is perfectly fine to claim all of it, if there is an option to limit the usage to a certain percentage (per_process_gpu_memory_fraction). What's useful in catboost's approach is the idea to deallocate GPU memory after returning results from the GPU back to CPU. BTW you have already something in common with catboost: you are the only two GPU-enabled frameworks (among the 12 or so I've tested for our containers at work) that give such control over GPU memory usage to the final user - something that even docker is not allowing us to do.

I noticed that it is customary among deep learning frameworks to create such GPU memory leaks (which it is) - none of them release GPU memory, requiring users to restart python kernel to release memory. Incidentally, none apart from Tensorflow claims nearly all of it, but that's what's so special about you:). In contrast to DNNs, GBDT frameworks tend to release what they no longer need - that applies not only to catboost but also to lightgbm.

IMO it would be good to contact catboost's GPU guy, Vasily (Noxomo) from Yandex team, because catboost manages to not only allocate 95% of GPU memory but also to use 100% of GPU compute power as well (despite sequential nature of boosting itself), which I've seen only in one DNN - the sadly neglected Theano...

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 2, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @poxvoculi: It has been 18 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@poxvoculi
Copy link
Contributor

As pointed out in the #15880 thread, it's not plausible to immediately and automatically release GPU memory after GPU ops complete because TF has persistent state objects (Variables) that live in GPU memory and are not restricted to any special sub-region.

@tensorflowbutler
Copy link
Member

Nagging Assignee @poxvoculi: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

Nagging Assignee @poxvoculi: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@mirekphd
Copy link
Author
mirekphd commented Aug 2, 2018

But there is certainly excessive greediness in Tensorflow’s memory allocation: it claims all RAM on all available GPU’s, but only uses by default one GPU – the one with the lowest ID.

@poxvoculi
Copy link
Contributor

@mirekphd
Copy link
Author
mirekphd commented Sep 28, 2018

Allocating video memory on all GPUs but using only 1 GPU's cores looks like a bug that could be addressed (by setting up less greedy defaults) - it is particularly important because multi-tenant environments such as OpenShift do not allow us to limit video RAM usage or GPU cores usage (unlike standard CPU case), so the current approach leads to blocking all resources by just one user operating on default settings.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

3 participants