[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Crash fix] Fix cudaMallocAsync crashes. #49173

Merged

Conversation

nouiz
Copy link
Contributor
@nouiz nouiz commented May 13, 2021

The first commit fixes #48869. The second commit fixes another follow up crashes when using TF_GPU_ALLOCATOR=cuda_malloc_async.

The 2 fixes are:

  • The Allocator API have the statistics optional. But the function BaseGPUDeviceFactory::CreateGPUDevice() check that they are available.
  • Bad handling of the ptr to the compute stream. It is a ptr to the compute stream that is passed, not the compute stream itself.
    @sanjoy @chsigg

@google-ml-butler google-ml-butler bot added the size:XS CL Change Size: Extra Small label May 13, 2021
@google-cla google-cla bot added the cla: yes label May 13, 2021
@nouiz nouiz force-pushed the upstream-cuda_malloc_async_fixes branch from 91ad81e to 378aaad Compare May 14, 2021 00:15
@gbaned gbaned self-assigned this May 14, 2021
@gbaned gbaned added comp:core issues related to core part of tensorflow prtype:bugfix PR to fix a bug labels May 14, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 14, 2021
@gbaned gbaned requested a review from sanjoy May 14, 2021 06:53
Copy link
Contributor
@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the PR description to state what the bug was and how this PR fixes it?

@@ -85,7 +85,7 @@ class GpuCudaMallocAsyncAllocator : public Allocator {

void SetStream(void* stream) override {
#if TF_CUDA_MALLOC_ASYNC_SUPPORTED
cuda_stream_ = reinterpret_cast<CUstream>(stream);
cuda_stream_ = *(reinterpret_cast<CUstream*>(stream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice this earlier, but a setter for an immutable property seems like a code smell to me. Have you tried passing in the stream via the constructor somehow?

(Does not need to be addressed in this PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly, the GpuDevice need the gpu_allocator. But the gpu_allocator need the stream created by the GpuDevice.
Fixing this would request a refactoring that will break existing API and I wasn't sure if those API was part of the public API or not. As I wanted this to get in TF2.5, I prefered to stay away from this refactoring.
Relevant code:

std::unique_ptr<BaseGPUDevice> gpu_device = CreateGPUDevice(
options, device_name, static_cast<Bytes>(bytes_limit), dev_locality,
tf_device_id, GetShortDeviceDescription(platform_device_id, *desc),
gpu_allocator, ProcessState::singleton()->GetCPUAllocator(numa_node));
LOG(INFO) << "Created device " << device_name << " with "
<< (bytes_limit >> 20) << " MB memory: "
<< " -> " << GetShortDeviceDescription(platform_device_id, *desc);
TF_RETURN_IF_ERROR(gpu_device->Init(options));
gpu_allocator->SetStream(gpu_device->GetStream());

(I also updated the PR description)
(I amended the test commit to remove a commented left over line)

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes May 18, 2021
@nouiz nouiz force-pushed the upstream-cuda_malloc_async_fixes branch from 378aaad to 2f856ab Compare May 18, 2021 13:10
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 19, 2021
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer May 19, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 19, 2021
@gbaned gbaned removed the ready to pull PR ready for merge process label May 21, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 21, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label May 21, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label May 26, 2021
@nouiz
Copy link
Contributor Author
nouiz commented May 26, 2021

I updated this PR. Now the new configuration options is experimental.
I added what got lost in my rebase the use of that new configuration option.
I also updated the test to verify that the right allocator got used. If you have another idea how to do this, tell me.

Copy link
Contributor
@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
CUmemoryPool pool_;
#endif // TF_CUDA_MALLOC_ASYNC_SUPPORTED

// Just a counter for the number of time this class is instanciated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/instanciated/instantiated/ everywhere.

Also s/nb/number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator {
CUmemoryPool pool_;
#endif // TF_CUDA_MALLOC_ASYNC_SUPPORTED

// Just a counter for the number of time this class is instanciated.
// Only useful for tests.
static int nb_instanciated_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an std::atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Note the current TF GpuDevice create them serially. But having it atomic is more future proof.

#endif
}

static int NbInstantiated() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this GetInstantiatedCountTestOnly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes May 27, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels May 28, 2021
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer May 28, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 8, 2021
@nouiz
Copy link
Contributor Author
nouiz commented Jun 8, 2021

I fixed the new conflict.
I also found a bug in the changes that caused the first merge conflict. I added the fix as it made my tests fail in CUDA 11.3.

Copy link
Member
@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

EXPECT_EQ(devices.size(), 1);
Device* device = devices[0].get();
auto* device_info = device->tensorflow_gpu_device_info();
CHECK(device_info);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use DCHECK instead? We don't use CHECK in new code anymore. Sorry we didn't notice this earlier!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test. I really want a crash even in no-debug mode.
So I pushed a change to use an EXPECT instead.

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Jun 8, 2021
@pkanwar23 pkanwar23 added the ready to pull PR ready for merge process label Jun 8, 2021
@google-ml-butler google-ml-butler bot added the kokoro:force-run Tests on submitted change label Jun 9, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jun 9, 2021
@nouiz
Copy link
Contributor Author
nouiz commented Jun 10, 2021

Just to be sure, I see many "Internal CI build failed". Do they block this PR? What is their cause? The windows error is that bazel didn't found the strategy "sandboxed". That have been reverted to my knowledge.

@akuegel
Copy link
Member
akuegel commented Jun 11, 2021

Just to be sure, I see many "Internal CI build failed". Do they block this PR? What is their cause? The windows error is that bazel didn't found the strategy "sandboxed". That have been reverted to my knowledge.

The problem was an unused variable. @penpornk has adjusted your patch internally, then that patch needed approval internally, now it is theoretically ready to land (if the tests pass).

@nouiz
Copy link
Contributor Author
nouiz commented Jun 11, 2021

Thanks for the update.

@nouiz
Copy link
Contributor Author
nouiz commented Jun 14, 2021

In some of the refactoring of cudaMallocAsync, when there is an OOM a deadlock was added.
Here is a commit that fix it: 4cab182

I wasn't sure if you wanted to include it in this PR or not. I can make another PR with the trivial fix if you want. Just tell me how you want to handle it.

@akuegel
Copy link
Member
akuegel commented Jun 15, 2021

In some of the refactoring of cudaMallocAsync, when there is an OOM a deadlock was added.
Here is a commit that fix it: 4cab182

I wasn't sure if you wanted to include it in this PR or not. I can make another PR with the trivial fix if you want. Just tell me how you want to handle it.

I think landing that separately is fine. As far as I can tell, the problem is a missing BUILD dependency for the new header include in the gpu_device_test. I added this as review comment on the internal change, hopefully that is the only remaining thing needed to fix to get it landed.

@nouiz
Copy link
Contributor Author
nouiz commented Jun 16, 2021

Any update? I think TF2.6 cutoff is soon. It would be great to get it included so people can start to use it manually.
Is there anything I can do to help?

@penpornk
Copy link
Member

@nouiz Apologies for the delay! A couple of tests failed the heapcheck (memory leaks). (You'll have to build them and run the binary with pprof to reproduce. The problem is these tests seem to have problems building in my open source environment right now -- so I'm not sure if you'll be able to build them.)

$ bazel build --config=cuda --config=opt  \
   //tensorflow/core/common_runtime/gpu:gpu_device_test
$ bazel build --config=cuda --config=opt \
   //tensorflow/core/common_runtime/gpu:gpu_device_test_gpu

Any idea where the leak could come from? I'm sorry I haven't had time to take a closer look at what the PR is doing yet. I'll try to do that today. Will try to get this into 2.6.

@nouiz
Copy link
Contributor Author
nouiz commented Jun 16, 2021

I looked at the code. But I do not understand what could leak.
Here there is a different cast: https://github.com/tensorflow/tensorflow/pull/49173/files#diff-d54c8000677cfd5b7bbc09963559b66949e9738b944210ee215dff19730ee296R88

I do not know pprof. Is there any options I must specify? I'll try to reproduce the error. Just in case I can help.

@penpornk
Copy link
Member

@nouiz Thank you very much for offering to help! :)
I think I've fixed the leaks. (I added the code that delete gpu_bfc_allocator back. But not the suballocator.)

      delete gpu_bfc_allocator;
      gpu_bfc_allocator = nullptr;

The merge is now stuck because of some (seemingly unrelated) infra issues instead. I'll keep you posted.

Just for reference, here is how to use pprof to check for memory leaks (paths are based on Ubuntu 18.04 -- may be different on different OSes):

# Install pprof and TCMalloc
$ sudo apt install libgoogle-perftools-dev google-perftools
$ export PPROF_PATH=/usr/bin/google-pprof

# Build the test
$ bazel build --config=cuda --config=opt //tensorflow/core/common_runtime/gpu:gpu_device_test

# Run the test with TCMalloc and pprof's heapcheck enabled.
$ HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1 \
    bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test

@copybara-service copybara-service bot merged commit 8704f47 into tensorflow:master Jun 17, 2021
PR Queue automation moved this from Reviewer Requested Changes to Merged Jun 17, 2021
@nouiz nouiz deleted the upstream-cuda_malloc_async_fixes branch June 17, 2021 12:55
@nouiz
Copy link
Contributor Author
nouiz commented Jun 17, 2021

It is now merged. Thanks for the help.

For pprof last instruction

    bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test```

I do not see a call to pprof. Shoudn't it be:

HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1
$PPROF_PATH bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test

Or something like that?

@penpornk
Copy link
Member

Thank you very much again for your patience!

I do not see a call to pprof. Shoudn't it be:

HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1
$PPROF_PATH bazel-bin/tensorflow/core/common_runtime/gpu/gpu_device_test

Or something like that?

Great question. Heap checker is part of TCMalloc. So we only need to link the binary to TCMalloc and turn on heap checker with the environment variable HEAPCHECK. It uses pprof underneath, that's why we need to install and set the path to pprof. :)

copybara-service bot pushed a commit that referenced this pull request Jun 21, 2021
Imported from GitHub PR #49173

The first commit fixes #48869. The second commit fixes another follow up crashes when using TF_GPU_ALLOCATOR=cuda_malloc_async.

The 2 fixes are:
- The Allocator API have...

PiperOrigin-RevId: 380643089
Change-Id: I06f04d8b2d8ed6b08b91f94123a5e9e8a1681793
nouiz added a commit to nouiz/tensorflow that referenced this pull request Jul 26, 2021
nouiz added a commit to nouiz/tensorflow that referenced this pull request Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow prtype:bugfix PR to fix a bug ready to pull PR ready for merge process size:XS CL Change Size: Extra Small
Projects
PR Queue
  
Merged
8 participants