[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash fix cudaMallocAsync usage of TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC. #50962

Merged

Conversation

nouiz
Copy link
Contributor
@nouiz nouiz commented Jul 26, 2021

Some of the refactoring caused crash when using TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC.
This PR fix them and add a test to be sure it continue to work.

This PR is on top of #50961 as otherwise there would be code conflict.

Now, GpuCudaMallocAsyncAllocator doesn't create its own compute stream. But it use one stream set after construction when GpuCudaMallocAsyncAllocator::SetStream is called. So it is now impossible to preallocate all memory during the object construction.
So this PR postpone the memory preallocation to when the stream is available, in SetStream.

This PR also check that the stream isn't set when SetStream is called.

@sanjoy

@tensorflow-bot tensorflow-bot bot added the size:M CL Change Size: Medium label Jul 26, 2021
@google-cla google-cla bot added the cla: yes label Jul 26, 2021
@gbaned gbaned self-assigned this Jul 27, 2021
@gbaned gbaned added the comp:core issues related to core part of tensorflow label Jul 27, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Jul 27, 2021
@gbaned gbaned requested a review from sanjoy July 27, 2021 04:16
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jul 27, 2021
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Jul 27, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jul 27, 2021
@sanjoy
Copy link
Contributor
sanjoy commented Jul 27, 2021

Please make the PR description a bit more descriptive (what was the bug?).

CC @hanbinyoon

@nouiz
Copy link
Contributor Author
nouiz commented Jul 27, 2021

I updated the description.

@nouiz nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from 1cb58ca to 17d3b28 Compare July 27, 2021 21:31
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Jul 27, 2021
@gbaned gbaned requested review from sanjoy and removed request for sanjoy July 28, 2021 13:46
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Aug 2, 2021
PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Aug 4, 2021
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 5, 2021
@gbaned
Copy link
Contributor
gbaned commented Aug 9, 2021

@nouiz Can you please check @sanjoy's comments and keep us posted ? Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Aug 9, 2021
@nouiz nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from 17d3b28 to e41fc39 Compare August 19, 2021 13:45
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 21, 2021
@nouiz nouiz force-pushed the upstream-cudaMallocAsync-Preallocate branch from e41fc39 to f95e862 Compare August 30, 2021 16:58
@nouiz
Copy link
Contributor Author
nouiz commented Aug 30, 2021

I rebased this PR since the dependent PR is merged.
I also did the review comment. So it should be ready to merge now.

@gbaned gbaned requested a review from sanjoy September 1, 2021 14:44
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label Sep 1, 2021
@nouiz
Copy link
Contributor Author
nouiz commented Sep 7, 2021

Any update?

Copy link
Contributor
@sanjoy sanjoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor comment.

tensorflow/core/framework/allocator.h Show resolved Hide resolved
@gbaned gbaned removed the awaiting review Pull request awaiting review label Sep 8, 2021
@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Sep 8, 2021
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Sep 8, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 8, 2021
@gbaned gbaned added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Sep 9, 2021
@nouiz
Copy link
Contributor Author
nouiz commented Sep 10, 2021

The CI seems to have failed:

feedback/copybara — Google internal checks FAILED for runs with create time 2021-09-09T07:29:11.802148007Z.

Is this related to this PR?

@copybara-service copybara-service bot merged commit 5a44e35 into tensorflow:master Sep 11, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Sep 11, 2021
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Sep 11, 2021
@nouiz nouiz deleted the upstream-cudaMallocAsync-Preallocate branch January 26, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:core issues related to core part of tensorflow size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

5 participants