[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopyTensor::ViaDMA function, allocator type sometimes not match actual input underlying memory type #60856

Open
chengchen666 opened this issue Jun 14, 2023 · 0 comments
Assignees
Labels
comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.12 For issues related to Tensorflow 2.12 type:bug Bug

Comments

@chengchen666
Copy link
chengchen666 commented Jun 14, 2023
Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

source

Tensorflow Version

2.12.0rc0

Custom Code

Yes

OS Platform and Distribution

CentOS Linux 7

Mobile device

No response

Python version

3.7.5

Bazel version

bazel 3.7.2

GCC/Compiler version

gcc-9

CUDA/cuDNN version

cuda 11, cudnn 8

GPU model and memory

Tesla V100S

Current Behaviour?

In CopyTensor::ViaDMA, alloc_attr decides the direction of memory copy. However, sometimes alloc_attr does not keep the same as the Tensor pointer's underlying memory type. In my case,src_alloc_attr.on_host() is True, but input->GetMemoryType() equals to kDevice. So this results the memory copy direction in this function is cpu->gpu, but actually the direction should be gpu -> gpu.
I think this bug does not reveal is because the cuda driver api, like cuMemcpyHtoD(), does not care about the direction if it's H to D or others, it only cares about the pointer attribute, if the src pointer is on device and dst pointer is also on device, even if we call cuMemcpyHtoD(), cuda driver would still do D to D copy. This feature would cover many bugs.

I haven't figured out where did the on_host attribute is set. From my understanding so far, same allocator object would be reused on different tensors, but the on_host attribute is one-way, once it's been set on_host, it cannot be unset later. This might cause some issue? Also, why wouln't we just use input->GetMemoryType() to decieds the memory copy direction, instead of the on_host attribute of alloc_attr

I meet this issue when I run horovod unit test case. Add some log message in
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L219
, such as

if(!src_alloc_attr.on_host() && (input->GetMemoryType()==AllocatorMemoryType::kHostPageable || input->GetMemoryType()==AllocatorMemoryType::kHostPinned)){
    std::cout<<"!!!!!!! src alloc not on host, but input mem type is on host"<< std::endl;
  }
  if( src_alloc_attr.on_host() && input->GetMemoryType()==AllocatorMemoryType::kDevice) {
    std::cout<<"!!!!!!! src alloc on host, but input mem is on device" << std::endl;
  }

  if(!dst_alloc_attr.on_host() && (output->GetMemoryType()==AllocatorMemoryType::kHostPageable || output->GetMemoryType()==AllocatorMemoryType::kHostPinned)){
    std::cout<<"!!!!!!! dst alloc not on host, but output mem type is on host"<< std::endl;
  }
  if( dst_alloc_attr.on_host() && output->GetMemoryType()==AllocatorMemoryType::kDevice) {
    std::cout<<"!!!!!!! dst alloc on host, but output mem is on device" << std::endl;
  }

For me, I ran horovod alltoall Op unit test case to reproduce this issue. But this issue might reveal in other cases.

Standalone code to reproduce the issue

Run horovod unit test case can reproduce this issue:
https://github.com/horovod/horovod/blob/master/test/parallel/test_tensorflow.py
horovodrun --mpi -np 2 pytest -s -v test_tensorflow.py::TensorFlowTests::test_horovod_alltoall_gpu.

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:bug Bug label Jun 14, 2023
@SuryanarayanaY SuryanarayanaY added TF 2.12 For issues related to Tensorflow 2.12 comp:runtime c++ runtime, performance issues (cpu) labels Jun 14, 2023
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.12 For issues related to Tensorflow 2.12 type:bug Bug
Projects
None yet
Development

No branches or pull requests

4 participants