LSTM layer Backprop error in Tensorflow

Glenn_01 · May 14, 2024, 5:22am

Hello everyone,
I’m trying to train this sequential model with an LSTM layer, but I’m getting errors. I’ve tried reducing the number of cells, but I still get errors. However, this error doesn’t occur with a standard RNN layer (SimpleRNN), but only with the LSTM and GRU layers.
Does anyone have an idea how to remedy this?
PS: I’m using version 2.16.1 of Tensorflow and 12.0 of cuda.
thank you for your feedback
Best regards,

#Definition
sequence_length = 120
features_len = 6
##Model
model1=keras.models.Sequential()
model1.add(keras.layers.InputLayer(input_shape=(sequence_length,features_len)))
model1.add(keras.layers.LSTM(32,return_sequences=False))
model1.add(keras.layers.Dense(120))
model1.compile(optimizer="rmsprop", loss="mse")
model1.summary()

history1 = model1.fit(dataset_train,
                    epochs=10,
                    validation_data=dataset_val
                     )
Epoch 1/10

2024-05-13 15:01:12.015551: E external/local_xla/xla/stream_executor/dnn.cc:1158] <unknown cudnn status: 14>
in external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc(2394): 'cudnnRNNBackwardData_v8( cudnn.handle(), rnn_desc.handle(), reinterpret_cast<const int*>(seq_lengths_data.opaque()), output_desc.data_handle(), output_data.opaque(), output_backprop_data.opaque(), input_desc.data_handle(), input_backprop_data->opaque(), input_h_desc.handle(), input_h_data.opaque(), output_h_backprop_data.opaque(), input_h_backprop_data->opaque(), input_c_desc.handle(), input_c_data.opaque(), output_c_backprop_data.opaque(), input_c_backprop_data->opaque(), rnn_desc.ParamsSizeInBytes(), params.opaque(), workspace.size(), workspace.opaque(), reserve_space_data->size(), reserve_space_data->opaque())'
2024-05-13 15:01:12.015582: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at cudnn_rnn_ops.cc:2192 : INTERNAL: Failed to call DoRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 6, 32, 1, 120, 256, 32] 
2024-05-13 15:01:12.015596: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INTERNAL: Failed to call DoRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 6, 32, 1, 120, 256, 32] 
	 [[{{function_node __inference_one_step_on_data_7252}}{{node gradient_tape/sequential_4_1/lstm_4_1/CudnnRNNBackpropV3}}]]

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
Cell In[24], line 1
----> 1 history1 = model1.fit(dataset_train,
      2                     epochs=10,
      3                     validation_data=dataset_val
      4                      )

File ~/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File ~/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     51 try:
     52   ctx.ensure_initialized()
---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     54                                       inputs, attrs, num_outputs)
     55 except core._NotOkStatusException as e:
     56   if name is not None:

InternalError: Graph execution error:

Detected at node gradient_tape/sequential_4_1/lstm_4_1/CudnnRNNBackpropV3 defined at (most recent call last):
  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/runpy.py", line 197, in _run_module_as_main

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/runpy.py", line 87, in _run_code

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel_launcher.py", line 17, in <module>

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/kernelapp.py", line 701, in start

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/tornado/platform/asyncio.py", line 195, in start

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/asyncio/base_events.py", line 601, in run_forever

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/asyncio/events.py", line 80, in _run

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 534, in dispatch_queue

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 523, in process_one

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 429, in dispatch_shell

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 767, in execute_request

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/ipkernel.py", line 429, in do_execute

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3024, in run_cell

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3079, in _run_cell

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3284, in run_cell_async

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3466, in run_ast_nodes

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code

  File "/tmp/ipykernel_12737/1001880985.py", line 1, in <module>

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/backend/tensorflow/trainer.py", line 314, in fit

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/backend/tensorflow/trainer.py", line 117, in one_step_on_iterator

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/backend/tensorflow/trainer.py", line 104, in one_step_on_data

  File "/home/otakagle/anaconda3/envs/tf1-gpu/lib/python3.9/site-packages/keras/src/backend/tensorflow/trainer.py", line 66, in train_step

Failed to call DoRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 6, 32, 1, 120, 256, 32] 
	 [[{{node gradient_tape/sequential_4_1/lstm_4_1/CudnnRNNBackpropV3}}]] [Op:__inference_one_step_on_iterator_7283]

Kiran_Sai_Ramineni · May 30, 2024, 10:32am

Hi @Glenn_01, This error might be due to 2 reasons, the GPU is running out of memory and cuda version mismatch with the tensorflow version you are using. To confirm whether it is the GPU related issue or not please try to run the code on the CPU. If it works fine on the CPU while executing the code on the GPU please try by reducing the batch_size and installing current versions on CUDA that match the tensorflow version.

As per test build configurations 2.16.1 supports CUDA 12.3 but I can see that you are using 12.0. please try by upgrading the cuda to 12.3. Thank You.

tagoma · June 1, 2024, 9:07pm

Along @Kiran_Sai_Ramineni’s reply, can you please try reducing batch size?

sheyiran1999 · June 10, 2024, 2:36am

In my project experience, when GPU running out of memory, thw 'Exhausted" error will occur. The reason of this error may be caused by the incompatibility between keras LSTM Layer relaization and the CuDNN backend.
Downgrade TF version might help.