Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

nouiz · 2021-07-26T21:21:50Z

Some of the refactoring caused crash when using TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC.
This PR fix them and add a test to be sure it continue to work.

This PR is on top of #50961 as otherwise there would be code conflict.

Now, GpuCudaMallocAsyncAllocator doesn't create its own compute stream. But it use one stream set after construction when GpuCudaMallocAsyncAllocator::SetStream is called. So it is now impossible to preallocate all memory during the object construction.
So this PR postpone the memory preallocation to when the stream is available, in SetStream.

This PR also check that the stream isn't set when SetStream is called.

@sanjoy

sanjoy · 2021-07-27T06:09:21Z

Please make the PR description a bit more descriptive (what was the bug?).

CC @hanbinyoon

nouiz · 2021-07-27T14:46:15Z

I updated the description.

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc

gbaned · 2021-08-09T15:35:15Z

@nouiz Can you please check @sanjoy's comments and keep us posted ? Thanks!

…ALLOC.

nouiz · 2021-08-30T16:58:56Z

I rebased this PR since the dependent PR is merged.
I also did the review comment. So it should be ready to merge now.

nouiz · 2021-09-07T13:24:55Z

Any update?

sanjoy

Only a minor comment.

tensorflow/core/framework/allocator.h

nouiz · 2021-09-10T20:23:00Z

The CI seems to have failed:

feedback/copybara — Google internal checks FAILED for runs with create time 2021-09-09T07:29:11.802148007Z.

Is this related to this PR?

tensorflow-bot bot added the size:M CL Change Size: Medium label Jul 26, 2021

google-cla bot added the cla: yes label Jul 26, 2021

gbaned self-assigned this Jul 27, 2021

gbaned added the comp:core issues related to core part of tensorflow label Jul 27, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation Jul 27, 2021

gbaned requested a review from sanjoy July 27, 2021 04:16

sanjoy approved these changes Jul 27, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 27, 2021

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 27, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 27, 2021

nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from 1cb58ca to 17d3b28 Compare July 27, 2021 21:31

google-ml-butler bot removed the ready to pull PR ready for merge process label Jul 27, 2021

gbaned requested review from sanjoy and removed request for sanjoy July 28, 2021 13:46

google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 2, 2021

sanjoy suggested changes Aug 4, 2021

View reviewed changes

tensorflow/core/common_runtime/gpu/gpu_cudamallocasync_allocator.cc Outdated Show resolved Hide resolved

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Aug 4, 2021

tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 5, 2021

gbaned added the stat:awaiting response Status - Awaiting response from author label Aug 9, 2021

nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from 17d3b28 to e41fc39 Compare August 19, 2021 13:45

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 21, 2021

nouiz added 3 commits August 30, 2021 08:23

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PRE…

bfc518f

…ALLOC.

Add a check.

c5019e2

[NFC] change function name

f95e862

nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from e41fc39 to f95e862 Compare August 30, 2021 16:58

gbaned requested a review from sanjoy September 1, 2021 14:44

google-ml-butler bot added the awaiting review Pull request awaiting review label Sep 1, 2021

Move a comment to the where it belong.

3d005f3

sanjoy suggested changes Sep 8, 2021

View reviewed changes

tensorflow/core/framework/allocator.h Show resolved Hide resolved

gbaned removed the awaiting review Pull request awaiting review label Sep 8, 2021

Update a comment.

cda095c

sanjoy approved these changes Sep 8, 2021

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 8, 2021

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Sep 8, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 8, 2021

gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Sep 9, 2021

copybara-service bot merged commit 5a44e35 into tensorflow:master Sep 11, 2021

PR Queue automation moved this from Approved by Reviewer to Merged Sep 11, 2021

google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 11, 2021

nouiz deleted the upstream-cudaMallocAsync-Preallocate branch January 26, 2022 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Conversation

Choose a reason for hiding this comment