`tf.debugging.enable_check_numerics` maximum recursion depth #69869

johnlarkin1 · 2024-06-17T13:04:20Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

v2.16.1-0-g5bc9d26649c 2.16.1

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04.3 LTS

Mobile device

No response

Python version

Python 3.11.0rc1

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to debug an incredibly annoying nan/inf issue in my Tensorflow model. I have tried to use the Tensorboard Debugger V2 but often it is unable to render when I specify -1 for the max_buffer_size because it generates around like 80GB of logs which is also unfortunate.

I'm trying to utilize tf.debugging.enable_check_numerics() so that I can find the first place that a nan/inf enters one of my Tensors and starts to corrupt it.

Standalone code to reproduce the issue

Here is the main entrypoint for the script that I'm executing: 


import os
from alphabet import ALPHABET_SIZE
from callbacks import (
    ExtendedModelCheckpoint,
    ModelCheckpointWithPeriod,
    PrintModelParametersCallback,
)
from loader import HandwritingDataLoader
from model.handwriting_models import (
    DeepHandwritingSynthesisModel,
)

import tensorflow as tf
import numpy as np
import datetime

from constants import (
    BATCH_SIZE,
    GRADIENT_CLIP_VALUE,
    LEARNING_RATE,
    NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS,
    NUM_ATTENTION_GAUSSIAN_COMPONENTS,
    NUM_EPOCH,
)

tf.keras.mixed_precision.set_global_policy('float32')
np.set_printoptions(threshold=10)

# Path logic for saved files
curr_directory = os.path.dirname(os.path.realpath(__file__))
model_name = "handwriting_synthesis"
model_save_dir = f"{curr_directory}/saved/models/{model_name}/"
start_day = datetime.datetime.now().strftime("%Y%m%d")
log_dir = f"{curr_directory}/saved/logs/{model_name}/profile/{start_day}"
debugging_dir = f"{curr_directory}/saved/logs/{model_name}/debug/{start_day}"
os.makedirs(model_save_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)
os.makedirs(debugging_dir, exist_ok=True)
model_save_path = os.path.join(model_save_dir, "best_model.keras")
checkpoint_model_filepath = os.path.join(
    model_save_dir, "model_{epoch:02d}_{val_loss:.2f}.keras"
)
model_save_path_final = os.path.join(model_save_dir, "best_model_final.keras")
epochs_info_path = os.path.join(model_save_dir, "epochs_info.json")

# Hyper parameters
num_mixture_components = NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS
num_attention_gaussians = NUM_ATTENTION_GAUSSIAN_COMPONENTS

# Training parameters
desired_epochs = 10_000
model_chkpt_period = 200


# Get the data
data_loader = HandwritingDataLoader()
data_loader.prepare_data()
combined_train_strokes, combined_train_lengths = (
    data_loader.combined_train_strokes,
    data_loader.combined_train_stroke_lengths,
)
combined_train_trans, combined_trans_lengths = (
    data_loader.combined_train_transcriptions,
    data_loader.combined_train_transcription_lengths,
)

if __name__ == "__main__":
    # Preparing the input and target data for training
    x_train = combined_train_strokes
    x_train_len = combined_train_lengths
    y_train = np.zeros_like(x_train)
    y_train[:, :-1, :] = x_train[:, 1:, :]
    y_train_len = x_train_len - 1

    # Preparing the transcription information
    char_seq = combined_train_trans
    char_seq_len = combined_trans_lengths

    best_loss = float("inf")
    batch_size = BATCH_SIZE
    print(f"Debugging info: {debugging_dir}")
    #tf.debugging.set_log_device_placement(True)
    tf.config.set_soft_device_placement(True)
    tf.debugging.enable_check_numerics()
    tf.debugging.experimental.enable_dump_debug_info(
        debugging_dir, tensor_debug_mode="FULL_HEALTH", circular_buffer_size=10_000_000
    )

    dataset = tf.data.Dataset.from_tensor_slices(
        (
            {
                "input_strokes": x_train,
                "input_stroke_lens": x_train_len,
                "input_chars": char_seq,
                "input_char_lens": char_seq_len,
            },
            y_train,
        )
    ).batch(BATCH_SIZE)

    stroke_model = DeepHandwritingSynthesisModel(
        units=400,
        num_layers=3,
        num_mixture_components=num_mixture_components,
        num_chars=ALPHABET_SIZE,
        num_attention_gaussians=num_attention_gaussians,
        gradient_clip_value=GRADIENT_CLIP_VALUE,
    )
    learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=LEARNING_RATE,
        decay_steps=50_000,
        decay_rate=0.96,
        staircase=True,
    )
    stroke_model.compile(
        optimizer=tf.keras.optimizers.RMSprop(
            learning_rate=learning_rate_schedule,
            global_clipnorm=GRADIENT_CLIP_VALUE,
        ),
    )

    callbacks = [
        tf.keras.callbacks.TensorBoard(
            log_dir=log_dir, histogram_freq=1, profile_batch="500,520"
        ),
        ModelCheckpointWithPeriod(model_name, period=200),
        ExtendedModelCheckpoint(model_name),
        PrintModelParametersCallback(),
    ]

    history = stroke_model.fit(dataset, epochs=desired_epochs, callbacks=callbacks)
    stroke_model.save(model_save_path)
    print("Training completed.")

Relevant log output

Debugging info: /root/code/src/saved/logs/handwriting_synthesis/debug/20240617
2024-06-17 12:56:20.745453: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Traceback (most recent call last):
  File "/root/code/src/train_handwriting_synthesis.py", line 115, in <module>
    stroke_model.compile(
  File "/root/code/venv/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/root/code/venv/lib/python3.11/site-packages/tensorflow/dtensor/python/api.py", line 64, in call_with_layout
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 1612, in _array_str_implementation
    return array2string(a, max_line_width, precision, suppress_small, ' ', "")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 736, in array2string
    return _array2string(a, options, separator, prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 513, in wrapper
    return f(self, *args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 539, in _array2string
    format_function = _get_format_function(data, **options)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 472, in _get_format_function
    return formatdict['float']()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 411, in <lambda>
    'float': lambda: FloatingFormat(
                     ^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded while calling a Python object

The text was updated successfully, but these errors were encountered:

Venkat6871 · 2024-06-20T05:08:55Z

Hi @johnlarkin1 ,

I reproduced the code shared but facing different error .Could you please share the colab gist with all the dependencies to analyze more of it.

Thank you!

github-actions · 2024-06-28T01:51:15Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-07-06T01:49:22Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-07-06T01:49:24Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:bug Bug label Jun 17, 2024

google-ml-butler bot assigned Venkat6871 Jun 17, 2024

Venkat6871 added the TF 2.16 label Jun 18, 2024

Venkat6871 added the stat:awaiting response Status - Awaiting response from author label Jun 20, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 28, 2024

github-actions bot closed this as completed Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tf.debugging.enable_check_numerics` maximum recursion depth #69869

`tf.debugging.enable_check_numerics` maximum recursion depth #69869

tf.debugging.enable_check_numerics maximum recursion depth #69869

tf.debugging.enable_check_numerics maximum recursion depth #69869

Comments

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

`tf.debugging.enable_check_numerics` maximum recursion depth #69869

`tf.debugging.enable_check_numerics` maximum recursion depth #69869