[Crash fix] Fix cudaMallocAsync crashes. #49173

nouiz · 2021-05-13T22:51:35Z

The first commit fixes #48869. The second commit fixes another follow up crashes when using TF_GPU_ALLOCATOR=cuda_malloc_async.

The 2 fixes are:

The Allocator API have the statistics optional. But the function BaseGPUDeviceFactory::CreateGPUDevice() check that they are available.
Bad handling of the ptr to the compute stream. It is a ptr to the compute stream that is passed, not the compute stream itself.
@sanjoy @chsigg

sanjoy

Can you please update the PR description to state what the bug was and how this PR fixes it?

sanjoy · 2021-05-18T05:43:21Z

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h

@@ -85,7 +85,7 @@ class GpuCudaMallocAsyncAllocator : public Allocator {

  void SetStream(void* stream) override {
 #if TF_CUDA_MALLOC_ASYNC_SUPPORTED
-    cuda_stream_ = reinterpret_cast<CUstream>(stream);
+    cuda_stream_ = *(reinterpret_cast<CUstream*>(stream));


I did not notice this earlier, but a setter for an immutable property seems like a code smell to me. Have you tried passing in the stream via the constructor somehow?

(Does not need to be addressed in this PR.)

Mostly, the GpuDevice need the gpu_allocator. But the gpu_allocator need the stream created by the GpuDevice.
Fixing this would request a refactoring that will break existing API and I wasn't sure if those API was part of the public API or not. As I wanted this to get in TF2.5, I prefered to stay away from this refactoring.
Relevant code:

tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc

Lines 1414 to 1422 in aa08697

std::unique_ptr<BaseGPUDevice> gpu_device = CreateGPUDevice(

options, device_name, static_cast<Bytes>(bytes_limit), dev_locality,

tf_device_id, GetShortDeviceDescription(platform_device_id, *desc),

gpu_allocator, ProcessState::singleton()->GetCPUAllocator(numa_node));

LOG(INFO) << "Created device " << device_name << " with "

<< (bytes_limit >> 20) << " MB memory: "

<< " -> " << GetShortDeviceDescription(platform_device_id, *desc);

TF_RETURN_IF_ERROR(gpu_device->Init(options));

gpu_allocator->SetStream(gpu_device->GetStream());

(I also updated the PR description)
(I amended the test commit to remove a commented left over line)

nouiz · 2021-05-26T13:51:45Z

I updated this PR. Now the new configuration options is experimental.
I added what got lost in my rebase the use of that new configuration option.
I also updated the test to verify that the right allocator got used. If you have another idea how to do this, tell me.

sanjoy

Minor comments.

sanjoy · 2021-05-27T05:34:24Z

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h

@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
  CUmemoryPool pool_;
 #endif  // TF_CUDA_MALLOC_ASYNC_SUPPORTED

+  // Just a counter for the number of time this class is instanciated.


s/instanciated/instantiated/ everywhere.

Also s/nb/number

sanjoy · 2021-05-27T05:35:15Z

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h

@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
  CUmemoryPool pool_;
 #endif  // TF_CUDA_MALLOC_ASYNC_SUPPORTED

+  // Just a counter for the number of time this class is instanciated.
+  // Only useful for tests.
+  static int nb_instanciated_;


Does this need to be an std::atomic?

Implemented. Note the current TF GpuDevice create them serially. But having it atomic is more future proof.

sanjoy · 2021-05-27T05:35:38Z

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.h

 #endif
  }

+  static int NbInstantiated() {


Maybe call this GetInstantiatedCountTestOnly?

nouiz · 2021-06-08T20:38:48Z

I fixed the new conflict.
I also found a bug in the changes that caused the first merge conflict. I added the fix as it made my tests fail in CUDA 11.3.

penpornk

Thank you very much!

penpornk · 2021-06-08T20:51:31Z

tensorflow/core/common_runtime/gpu/gpu_device_test.cc

+    EXPECT_EQ(devices.size(), 1);
+    Device* device = devices[0].get();
+    auto* device_info = device->tensorflow_gpu_device_info();
+    CHECK(device_info);


Could you please use DCHECK instead? We don't use CHECK in new code anymore. Sorry we didn't notice this earlier!

This is a test. I really want a crash even in no-debug mode.
So I pushed a change to use an EXPECT instead.

nouiz · 2021-06-10T21:21:57Z

Just to be sure, I see many "Internal CI build failed". Do they block this PR? What is their cause? The windows error is that bazel didn't found the strategy "sandboxed". That have been reverted to my knowledge.

tensorflow/core/protobuf/config.proto

akuegel · 2021-06-11T05:59:33Z

Just to be sure, I see many "Internal CI build failed". Do they block this PR? What is their cause? The windows error is that bazel didn't found the strategy "sandboxed". That have been reverted to my knowledge.

The problem was an unused variable. @penpornk has adjusted your patch internally, then that patch needed approval internally, now it is theoretically ready to land (if the tests pass).

nouiz · 2021-06-11T12:31:22Z

Thanks for the update.

nouiz · 2021-06-14T21:35:14Z

In some of the refactoring of cudaMallocAsync, when there is an OOM a deadlock was added.
Here is a commit that fix it: 4cab182

I wasn't sure if you wanted to include it in this PR or not. I can make another PR with the trivial fix if you want. Just tell me how you want to handle it.

akuegel · 2021-06-15T08:58:41Z

In some of the refactoring of cudaMallocAsync, when there is an OOM a deadlock was added.
Here is a commit that fix it: 4cab182

I wasn't sure if you wanted to include it in this PR or not. I can make another PR with the trivial fix if you want. Just tell me how you want to handle it.

I think landing that separately is fine. As far as I can tell, the problem is a missing BUILD dependency for the new header include in the gpu_device_test. I added this as review comment on the internal change, hopefully that is the only remaining thing needed to fix to get it landed.

nouiz · 2021-06-16T15:23:29Z

Any update? I think TF2.6 cutoff is soon. It would be great to get it included so people can start to use it manually.
Is there anything I can do to help?

penpornk · 2021-06-16T15:44:50Z

@nouiz Apologies for the delay! A couple of tests failed the heapcheck (memory leaks). (You'll have to build them and run the binary with pprof to reproduce. The problem is these tests seem to have problems building in my open source environment right now -- so I'm not sure if you'll be able to build them.)

$ bazel build --config=cuda --config=opt  \
   //tensorflow/core/common_runtime/gpu:gpu_device_test
$ bazel build --config=cuda --config=opt \
   //tensorflow/core/common_runtime/gpu:gpu_device_test_gpu

Any idea where the leak could come from? I'm sorry I haven't had time to take a closer look at what the PR is doing yet. I'll try to do that today. Will try to get this into 2.6.

nouiz · 2021-06-16T17:51:48Z

I looked at the code. But I do not understand what could leak.
Here there is a different cast: https://github.com/tensorflow/tensorflow/pull/49173/files#diff-d54c8000677cfd5b7bbc09963559b66949e9738b944210ee215dff19730ee296R88

I do not know pprof. Is there any options I must specify? I'll try to reproduce the error. Just in case I can help.

penpornk · 2021-06-17T04:04:20Z

@nouiz Thank you very much for offering to help! :)
I think I've fixed the leaks. (I added the code that delete gpu_bfc_allocator back. But not the suballocator.)

      delete gpu_bfc_allocator;
      gpu_bfc_allocator = nullptr;

The merge is now stuck because of some (seemingly unrelated) infra issues instead. I'll keep you posted.

Just for reference, here is how to use pprof to check for memory leaks (paths are based on Ubuntu 18.04 -- may be different on different OSes):

# Install pprof and TCMalloc
$ sudo apt install libgoogle-perftools-dev google-perftools
$ export PPROF_PATH=/usr/bin/google-pprof

# Build the test
$ bazel build --config=cuda --config=opt //tensorflow/core/common_runtime/gpu:gpu_device_test

# Run the test with TCMalloc and pprof's heapcheck enabled.
$ HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1 \
    bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test

nouiz · 2021-06-17T12:57:08Z

It is now merged. Thanks for the help.

For pprof last instruction

    bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test```

I do not see a call to pprof. Shoudn't it be:

HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1
$PPROF_PATH bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test

Or something like that?

penpornk · 2021-06-17T15:15:47Z

Thank you very much again for your patience!

I do not see a call to pprof. Shoudn't it be:
HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1
$PPROF_PATH bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test
Or something like that?

Great question. Heap checker is part of TCMalloc. So we only need to link the binary to TCMalloc and turn on heap checker with the environment variable HEAPCHECK. It uses pprof underneath, that's why we need to install and set the path to pprof. :)

Imported from GitHub PR #49173 The first commit fixes #48869. The second commit fixes another follow up crashes when using TF_GPU_ALLOCATOR=cuda_malloc_async. The 2 fixes are: - The Allocator API have... PiperOrigin-RevId: 380643089 Change-Id: I06f04d8b2d8ed6b08b91f94123a5e9e8a1681793

This reverts commit a4553f8.

google-ml-butler bot added the size:XS CL Change Size: Extra Small label May 13, 2021

google-cla bot added the cla: yes label May 13, 2021

nouiz force-pushed the upstream-cuda_malloc_async_fixes branch from 91ad81e to 378aaad Compare May 14, 2021 00:15

gbaned self-assigned this May 14, 2021

gbaned added comp:core issues related to core part of tensorflow prtype:bugfix PR to fix a bug labels May 14, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation May 14, 2021

gbaned requested a review from sanjoy May 14, 2021 06:53

sanjoy suggested changes May 18, 2021

View reviewed changes

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes May 18, 2021

nouiz force-pushed the upstream-cuda_malloc_async_fixes branch from 378aaad to 2f856ab Compare May 18, 2021 13:10

sanjoy approved these changes May 19, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 19, 2021

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer May 19, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label May 19, 2021

sanjoy mentioned this pull request May 20, 2021

Introduce ability to clear GPU memory in Tensorflow 2 #48545

Open

gbaned removed the ready to pull PR ready for merge process label May 21, 2021

gbaned approved these changes May 21, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 21, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label May 21, 2021

google-ml-butler bot removed the ready to pull PR ready for merge process label May 26, 2021

sanjoy suggested changes May 27, 2021

View reviewed changes

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes May 27, 2021

sanjoy approved these changes May 28, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 28, 2021

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer May 28, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 8, 2021

penpornk requested changes Jun 8, 2021

View reviewed changes

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Jun 8, 2021

Replace a CHECK to EXPECT_NE.

10b0357

pkanwar23 added the ready to pull PR ready for merge process label Jun 8, 2021

sanjoy approved these changes Jun 9, 2021

View reviewed changes

google-ml-butler bot added the kokoro:force-run Tests on submitted change label Jun 9, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 9, 2021

ruler501 reviewed Jun 11, 2021

View reviewed changes

tensorflow/core/protobuf/config.proto Show resolved Hide resolved

nouiz mentioned this pull request Jun 15, 2021

Remove a deadlock in cudaMallocAsync #50280

Merged

copybara-service bot merged commit 8704f47 into tensorflow:master Jun 17, 2021

PR Queue automation moved this from Reviewer Requested Changes to Merged Jun 17, 2021

nouiz deleted the upstream-cuda_malloc_async_fixes branch June 17, 2021 12:55

nouiz mentioned this pull request Jul 26, 2021

Internal Error with TF_GPU_ALLOCATOR=cuda_malloc_async #50669

Closed

nouiz added a commit to nouiz/tensorflow that referenced this pull request Jul 26, 2021

Revert "PR tensorflow#49173: [Crash fix] Fix cudaMallocAsync crashes."

e755f6b

This reverts commit a4553f8.

nouiz mentioned this pull request Aug 16, 2021

Add back "PR #49173: [Crash fix] Fix cudaMallocAsync crashes." #50961

Merged

nouiz added a commit to nouiz/tensorflow that referenced this pull request Aug 18, 2021

Revert "PR tensorflow#49173: [Crash fix] Fix cudaMallocAsync crashes."

d90dc7d

This reverts commit a4553f8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Crash fix] Fix cudaMallocAsync crashes. #49173

[Crash fix] Fix cudaMallocAsync crashes. #49173

	std::unique_ptr<BaseGPUDevice> gpu_device = CreateGPUDevice(
	options, device_name, static_cast<Bytes>(bytes_limit), dev_locality,
	tf_device_id, GetShortDeviceDescription(platform_device_id, *desc),
	gpu_allocator, ProcessState::singleton()->GetCPUAllocator(numa_node));
	LOG(INFO) << "Created device " << device_name << " with "
	<< (bytes_limit >> 20) << " MB memory: "
	<< " -> " << GetShortDeviceDescription(platform_device_id, *desc);
	TF_RETURN_IF_ERROR(gpu_device->Init(options));
	gpu_allocator->SetStream(gpu_device->GetStream());

[Crash fix] Fix cudaMallocAsync crashes. #49173

[Crash fix] Fix cudaMallocAsync crashes. #49173

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment