CopyTensor::ViaDMA function, allocator type sometimes not match actual input underlying memory type #60856
Labels
comp:runtime
c++ runtime, performance issues (cpu)
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
TF 2.12
For issues related to Tensorflow 2.12
type:bug
Bug
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
No
Source
source
Tensorflow Version
2.12.0rc0
Custom Code
Yes
OS Platform and Distribution
CentOS Linux 7
Mobile device
No response
Python version
3.7.5
Bazel version
bazel 3.7.2
GCC/Compiler version
gcc-9
CUDA/cuDNN version
cuda 11, cudnn 8
GPU model and memory
Tesla V100S
Current Behaviour?
In
CopyTensor::ViaDMA
,alloc_attr
decides the direction of memory copy. However, sometimesalloc_attr
does not keep the same as the Tensor pointer's underlying memory type. In my case,src_alloc_attr.on_host()
isTrue
, butinput->GetMemoryType()
equals tokDevice
. So this results the memory copy direction in this function is cpu->gpu, but actually the direction should be gpu -> gpu.I think this bug does not reveal is because the cuda driver api, like
cuMemcpyHtoD()
, does not care about the direction if it's H to D or others, it only cares about the pointer attribute, if the src pointer is on device and dst pointer is also on device, even if we callcuMemcpyHtoD()
, cuda driver would still do D to D copy. This feature would cover many bugs.I haven't figured out where did the
on_host
attribute is set. From my understanding so far, same allocator object would be reused on different tensors, but theon_host
attribute is one-way, once it's been seton_host
, it cannot be unset later. This might cause some issue? Also, why wouln't we just useinput->GetMemoryType()
to decieds the memory copy direction, instead of theon_host
attribute ofalloc_attr
I meet this issue when I run horovod unit test case. Add some log message in
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/copy_tensor.cc#L219
, such as
For me, I ran horovod
alltoall Op
unit test case to reproduce this issue. But this issue might reveal in other cases.Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: