tf.data.Dataset prefetch not fetching data asynchronously #61084

zackwohl · 2023-06-26T19:39:17Z

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

source

Tensorflow Version

2.11

Custom Code

Yes

OS Platform and Distribution

Debian/Linux 11

Mobile device

No response

Python version

3.7

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

After implementing a data pipeline using tf.data.Dataset to pull image data from Google Cloud Storage, TensorBoard profiler shows that the GPU compute and CPU prefetch are running synchronously. I used data.Dataset.AUTOTUNE to determine the appropriate prefetch batch size. Monitoring GPU usage while the model is running confirms this with the GPU at 0% utilization to actually computing something for about a 2:1 ratio, which is reflected in the profiler. CPU usage when monitored does not appear to max out.

I expected the prefetch to occur concurrently with GPU processing as described in the data.Dataset documentation and tutorials.

Standalone code to reproduce the issue

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ['TF_GPU_ALLOCATOR'] = "cuda_malloc_async"
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

def get_label(file_path):
    parts = tf.strings.split(file_path, os.path.sep)
    one_hot = parts[-2] == class_names
    return tf.argmax(one_hot)

def decode_img(img):
    img = tf.io.decode_image(img, channels=3, expand_animations = False)
    img = tf.image.resize(img, [244, 244])
    img = tf.cast(img, tf.float32)
    return img

def process_path(file_path):
    label = get_label(file_path)
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

def configure_for_performance(ds):
    ds = ds.batch(128)
    ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    return ds

files = tf.data.Dataset.list_files((data_dir + '/*/*.png'), shuffle=False)
files = files.shuffle(image_count, reshuffle_each_iteration=False)

val_size = int(image_count * 0.2)

train_files = files.skip(val_size)
val_files = files.take(val_size)

train_ds = train_files.interleave(lambda x: tf.data.Dataset.from_tensor_slices([x]), cycle_length=4, num_parallel_calls=tf.data.AUTOTUNE)
train_ds = train_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)

val_ds = val_files.interleave(lambda x: tf.data.Dataset.from_tensor_slices([x]), cycle_length=4, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)

train_ds = configure_for_performance(train_ds)
val_ds = configure_for_performance(val_ds)

Relevant log output

No response

SuryanarayanaY · 2023-06-27T10:15:28Z

Hi @zackwohl ,

Thanks for reaching us. Could you able to submit a colab gist replicating the reported behaviour with an image dataset.

Also can you confirm the behaviour with buffer_size=1 or 2 instead of tf.data.AUTOTUNE just to cross check the behaviour.

Thanks!

zackwohl · 2023-06-28T18:04:56Z

Hi @SuryanarayanaY, I tried running this with buffer_size=2, and it continued to run synchronously. I've attached images of the tensorboard profiler trace viewer.

How would I submit colab gist and what would you need in terms of data? I currently have my code in a jupyter notebook.

Thanks

zackwohl · 2023-06-30T18:52:01Z

Hi @SuryanarayanaY, just wanted to follow up on next steps here.

zackwohl · 2023-07-05T17:10:53Z

Hi @SuryanarayanaY, I'm still awaiting a response.

google-ml-butler bot added the type:bug Bug label Jun 26, 2023

google-ml-butler bot assigned SuryanarayanaY Jun 26, 2023

SuryanarayanaY added TF 2.11 Issues related to TF 2.11 comp:data tf.data related issues type:performance Performance Issue labels Jun 27, 2023

SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Jun 27, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jun 28, 2023

SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.data.Dataset prefetch not fetching data asynchronously #61084

tf.data.Dataset prefetch not fetching data asynchronously #61084

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

tf.data.Dataset prefetch not fetching data asynchronously #61084

tf.data.Dataset prefetch not fetching data asynchronously #61084

Comments

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output