[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.debugging.enable_check_numerics maximum recursion depth #69869

Closed
johnlarkin1 opened this issue Jun 17, 2024 · 4 comments
Closed

tf.debugging.enable_check_numerics maximum recursion depth #69869

johnlarkin1 opened this issue Jun 17, 2024 · 4 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.16 type:bug Bug

Comments

@johnlarkin1
Copy link
johnlarkin1 commented Jun 17, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

v2.16.1-0-g5bc9d26649c 2.16.1

Custom code

Yes

OS platform and distribution

Linux Ubuntu 22.04.3 LTS

Mobile device

No response

Python version

Python 3.11.0rc1

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to debug an incredibly annoying nan/inf issue in my Tensorflow model. I have tried to use the Tensorboard Debugger V2 but often it is unable to render when I specify -1 for the max_buffer_size because it generates around like 80GB of logs which is also unfortunate.

I'm trying to utilize tf.debugging.enable_check_numerics() so that I can find the first place that a nan/inf enters one of my Tensors and starts to corrupt it.

Standalone code to reproduce the issue

Here is the main entrypoint for the script that I'm executing: 


import os
from alphabet import ALPHABET_SIZE
from callbacks import (
    ExtendedModelCheckpoint,
    ModelCheckpointWithPeriod,
    PrintModelParametersCallback,
)
from loader import HandwritingDataLoader
from model.handwriting_models import (
    DeepHandwritingSynthesisModel,
)

import tensorflow as tf
import numpy as np
import datetime

from constants import (
    BATCH_SIZE,
    GRADIENT_CLIP_VALUE,
    LEARNING_RATE,
    NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS,
    NUM_ATTENTION_GAUSSIAN_COMPONENTS,
    NUM_EPOCH,
)

tf.keras.mixed_precision.set_global_policy('float32')
np.set_printoptions(threshold=10)

# Path logic for saved files
curr_directory = os.path.dirname(os.path.realpath(__file__))
model_name = "handwriting_synthesis"
model_save_dir = f"{curr_directory}/saved/models/{model_name}/"
start_day = datetime.datetime.now().strftime("%Y%m%d")
log_dir = f"{curr_directory}/saved/logs/{model_name}/profile/{start_day}"
debugging_dir = f"{curr_directory}/saved/logs/{model_name}/debug/{start_day}"
os.makedirs(model_save_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)
os.makedirs(debugging_dir, exist_ok=True)
model_save_path = os.path.join(model_save_dir, "best_model.keras")
checkpoint_model_filepath = os.path.join(
    model_save_dir, "model_{epoch:02d}_{val_loss:.2f}.keras"
)
model_save_path_final = os.path.join(model_save_dir, "best_model_final.keras")
epochs_info_path = os.path.join(model_save_dir, "epochs_info.json")

# Hyper parameters
num_mixture_components = NUM_BIVARIATE_GAUSSIAN_MIXTURE_COMPONENTS
num_attention_gaussians = NUM_ATTENTION_GAUSSIAN_COMPONENTS

# Training parameters
desired_epochs = 10_000
model_chkpt_period = 200


# Get the data
data_loader = HandwritingDataLoader()
data_loader.prepare_data()
combined_train_strokes, combined_train_lengths = (
    data_loader.combined_train_strokes,
    data_loader.combined_train_stroke_lengths,
)
combined_train_trans, combined_trans_lengths = (
    data_loader.combined_train_transcriptions,
    data_loader.combined_train_transcription_lengths,
)

if __name__ == "__main__":
    # Preparing the input and target data for training
    x_train = combined_train_strokes
    x_train_len = combined_train_lengths
    y_train = np.zeros_like(x_train)
    y_train[:, :-1, :] = x_train[:, 1:, :]
    y_train_len = x_train_len - 1

    # Preparing the transcription information
    char_seq = combined_train_trans
    char_seq_len = combined_trans_lengths

    best_loss = float("inf")
    batch_size = BATCH_SIZE
    print(f"Debugging info: {debugging_dir}")
    #tf.debugging.set_log_device_placement(True)
    tf.config.set_soft_device_placement(True)
    tf.debugging.enable_check_numerics()
    tf.debugging.experimental.enable_dump_debug_info(
        debugging_dir, tensor_debug_mode="FULL_HEALTH", circular_buffer_size=10_000_000
    )

    dataset = tf.data.Dataset.from_tensor_slices(
        (
            {
                "input_strokes": x_train,
                "input_stroke_lens": x_train_len,
                "input_chars": char_seq,
                "input_char_lens": char_seq_len,
            },
            y_train,
        )
    ).batch(BATCH_SIZE)

    stroke_model = DeepHandwritingSynthesisModel(
        units=400,
        num_layers=3,
        num_mixture_components=num_mixture_components,
        num_chars=ALPHABET_SIZE,
        num_attention_gaussians=num_attention_gaussians,
        gradient_clip_value=GRADIENT_CLIP_VALUE,
    )
    learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=LEARNING_RATE,
        decay_steps=50_000,
        decay_rate=0.96,
        staircase=True,
    )
    stroke_model.compile(
        optimizer=tf.keras.optimizers.RMSprop(
            learning_rate=learning_rate_schedule,
            global_clipnorm=GRADIENT_CLIP_VALUE,
        ),
    )

    callbacks = [
        tf.keras.callbacks.TensorBoard(
            log_dir=log_dir, histogram_freq=1, profile_batch="500,520"
        ),
        ModelCheckpointWithPeriod(model_name, period=200),
        ExtendedModelCheckpoint(model_name),
        PrintModelParametersCallback(),
    ]

    history = stroke_model.fit(dataset, epochs=desired_epochs, callbacks=callbacks)
    stroke_model.save(model_save_path)
    print("Training completed.")

Relevant log output

Debugging info: /root/code/src/saved/logs/handwriting_synthesis/debug/20240617
2024-06-17 12:56:20.745453: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Traceback (most recent call last):
  File "/root/code/src/train_handwriting_synthesis.py", line 115, in <module>
    stroke_model.compile(
  File "/root/code/venv/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/root/code/venv/lib/python3.11/site-packages/tensorflow/dtensor/python/api.py", line 64, in call_with_layout
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 1612, in _array_str_implementation
    return array2string(a, max_line_width, precision, suppress_small, ' ', "")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 736, in array2string
    return _array2string(a, options, separator, prefix)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 513, in wrapper
    return f(self, *args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 539, in _array2string
    format_function = _get_format_function(data, **options)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 472, in _get_format_function
    return formatdict['float']()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/code/venv/lib/python3.11/site-packages/numpy/core/arrayprint.py", line 411, in <lambda>
    'float': lambda: FloatingFormat(
                     ^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded while calling a Python object
@Venkat6871
Copy link
Contributor

Hi @johnlarkin1 ,

  • I reproduced the code shared but facing different error .Could you please share the colab gist with all the dependencies to analyze more of it.

Thank you!

@Venkat6871 Venkat6871 added the stat:awaiting response Status - Awaiting response from author label Jun 20, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 28, 2024
Copy link
github-actions bot commented Jul 6, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@github-actions github-actions bot closed this as completed Jul 6, 2024
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.16 type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants