-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Crash fix] Fix cudaMallocAsync crashes. #49173
[Crash fix] Fix cudaMallocAsync crashes. #49173
Conversation
91ad81e
to
378aaad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update the PR description to state what the bug was and how this PR fixes it?
@@ -85,7 +85,7 @@ class GpuCudaMallocAsyncAllocator : public Allocator { | |||
|
|||
void SetStream(void* stream) override { | |||
#if TF_CUDA_MALLOC_ASYNC_SUPPORTED | |||
cuda_stream_ = reinterpret_cast<CUstream>(stream); | |||
cuda_stream_ = *(reinterpret_cast<CUstream*>(stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice this earlier, but a setter for an immutable property seems like a code smell to me. Have you tried passing in the stream via the constructor somehow?
(Does not need to be addressed in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly, the GpuDevice need the gpu_allocator. But the gpu_allocator need the stream created by the GpuDevice.
Fixing this would request a refactoring that will break existing API and I wasn't sure if those API was part of the public API or not. As I wanted this to get in TF2.5, I prefered to stay away from this refactoring.
Relevant code:
tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc
Lines 1414 to 1422 in aa08697
std::unique_ptr<BaseGPUDevice> gpu_device = CreateGPUDevice( | |
options, device_name, static_cast<Bytes>(bytes_limit), dev_locality, | |
tf_device_id, GetShortDeviceDescription(platform_device_id, *desc), | |
gpu_allocator, ProcessState::singleton()->GetCPUAllocator(numa_node)); | |
LOG(INFO) << "Created device " << device_name << " with " | |
<< (bytes_limit >> 20) << " MB memory: " | |
<< " -> " << GetShortDeviceDescription(platform_device_id, *desc); | |
TF_RETURN_IF_ERROR(gpu_device->Init(options)); | |
gpu_allocator->SetStream(gpu_device->GetStream()); |
(I also updated the PR description)
(I amended the test commit to remove a commented left over line)
378aaad
to
2f856ab
Compare
I updated this PR. Now the new configuration options is experimental. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator { | |||
CUmemoryPool pool_; | |||
#endif // TF_CUDA_MALLOC_ASYNC_SUPPORTED | |||
|
|||
// Just a counter for the number of time this class is instanciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/instanciated/instantiated/ everywhere.
Also s/nb/number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -106,6 +110,10 @@ class GpuCudaMallocAsyncAllocator : public Allocator { | |||
CUmemoryPool pool_; | |||
#endif // TF_CUDA_MALLOC_ASYNC_SUPPORTED | |||
|
|||
// Just a counter for the number of time this class is instanciated. | |||
// Only useful for tests. | |||
static int nb_instanciated_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be an std::atomic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented. Note the current TF GpuDevice create them serially. But having it atomic is more future proof.
#endif | ||
} | ||
|
||
static int NbInstantiated() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this GetInstantiatedCountTestOnly
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I fixed the new conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
EXPECT_EQ(devices.size(), 1); | ||
Device* device = devices[0].get(); | ||
auto* device_info = device->tensorflow_gpu_device_info(); | ||
CHECK(device_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use DCHECK
instead? We don't use CHECK
in new code anymore. Sorry we didn't notice this earlier!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test. I really want a crash even in no-debug mode.
So I pushed a change to use an EXPECT instead.
Just to be sure, I see many "Internal CI build failed". Do they block this PR? What is their cause? The windows error is that bazel didn't found the strategy "sandboxed". That have been reverted to my knowledge. |
The problem was an unused variable. @penpornk has adjusted your patch internally, then that patch needed approval internally, now it is theoretically ready to land (if the tests pass). |
Thanks for the update. |
In some of the refactoring of cudaMallocAsync, when there is an OOM a deadlock was added. I wasn't sure if you wanted to include it in this PR or not. I can make another PR with the trivial fix if you want. Just tell me how you want to handle it. |
I think landing that separately is fine. As far as I can tell, the problem is a missing BUILD dependency for the new header include in the gpu_device_test. I added this as review comment on the internal change, hopefully that is the only remaining thing needed to fix to get it landed. |
Any update? I think TF2.6 cutoff is soon. It would be great to get it included so people can start to use it manually. |
@nouiz Apologies for the delay! A couple of tests failed the heapcheck (memory leaks). (You'll have to build them and run the binary with pprof to reproduce. The problem is these tests seem to have problems building in my open source environment right now -- so I'm not sure if you'll be able to build them.)
Any idea where the leak could come from? I'm sorry I haven't had time to take a closer look at what the PR is doing yet. I'll try to do that today. Will try to get this into 2.6. |
I looked at the code. But I do not understand what could leak. I do not know pprof. Is there any options I must specify? I'll try to reproduce the error. Just in case I can help. |
@nouiz Thank you very much for offering to help! :) delete gpu_bfc_allocator;
gpu_bfc_allocator = nullptr; The merge is now stuck because of some (seemingly unrelated) infra issues instead. I'll keep you posted. Just for reference, here is how to use
|
It is now merged. Thanks for the help. For pprof last instruction
HEAPCHECK=normal LD_PRELOAD=/usr/lib/libtcmalloc.so TF2_BEHAVIOR=1
|
Thank you very much again for your patience!
Great question. Heap checker is part of TCMalloc. So we only need to link the binary to TCMalloc and turn on heap checker with the environment variable |
This reverts commit a4553f8.
This reverts commit a4553f8.
The first commit fixes #48869. The second commit fixes another follow up crashes when using TF_GPU_ALLOCATOR=cuda_malloc_async.
The 2 fixes are:
@sanjoy @chsigg