Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

philiprekers · 2020-05-15T11:17:28Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.4 LTS
TensorFlow installed from (source or binary): pip 20.1
TensorFlow version (use command below): v2.2.0-rc4-8-g2b96f3662b 2.2.0
Python version: Python 3.7.7
GCC/Compiler version (if compiling from source): gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CUDA/cuDNN version: Cuda compilation tools, release 10.0, V10.0.130
GPU model and memory: Nvidia K80 12GB (AWS p2.xlarge Deep Learning AMI (Ubuntu 18.04) Version 28.1)

Describe the current behavior
Training a simple model on a data set:
X shape: (35000, 200, 311)
y shape: (35000, 200, 19)
Right before the start of the first epoch, when using TF 2.2.0 the training stops with an error:
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]
(abbreviated for readability and because this error seems to point to Out Of Memory issues indirectly)
(also because the code example below can reproduce the error message)

Interestingly: The exact same setup, model and training set trains well on TF 1.14

NOTES:

I've tried changing the batch size (down to 1) to no avail.
Decreasing Training Set to ~10,000 test cases works fine.
Error also occurs on freshly re-started instances.
allow_growth option didn't make a difference.
GPU memory is being used by training (confirmed by looking at GPU usage and by showing found GPU devices)

Describe the expected behavior
Training above model runs as expected and does not error out.

Standalone code to reproduce the issue
The colab link can be executed, but if not run on the specified hardware, it wouldn't make sense to test this way.

Best to reproduce is running following code on the described setup with TF 1.14 and TF 2.2.0

import json
import numpy as np
from tensorflow.keras.layers import GRU, Bidirectional, Dense, Masking, Input, TimeDistributed
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

y_raw = np.random.randint(19, size=(35000,200))
y = to_categorical(y_raw, num_classes=19)
X = np.random.rand(35000, 200, 311)

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(200,311)))
model.add(
            Bidirectional(
                        GRU(
                            256,
                            return_sequences=True,
                            unroll=True,
                            recurrent_dropout=0.233,
                            recurrent_activation="sigmoid"
                           )
                         )
         )

model.add(TimeDistributed(Dense(19, activation="softmax")))

model.compile(
            loss="categorical_crossentropy",
            optimizer="rmsprop",
            metrics=["categorical_accuracy"],
            )

model.summary()

architecture_path = "candidate_architecture.json"
model_json = model.to_json()
with open(architecture_path, 'w') as json_file:
    json_file.write(model_json)
print(f"Saved model architecture to {architecture_path}")

filepath = "candidate_model-{epoch:02d}-{val_categorical_accuracy:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min', save_weights_only=True)
callbacks_list = [checkpoint]

history = model.fit(X, y, epochs=10,
                    validation_split=0.2,
                    batch_size=16,
                    callbacks=callbacks_list)

Other info / logs
Full traceback of error:

2020-05-14 07:16:34.619007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-14 07:16:34.652784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.653592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-05-14 07:16:34.653944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:16:34.656389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-14 07:16:34.658467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-14 07:16:34.658882: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-14 07:16:34.661237: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-14 07:16:34.662515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-14 07:16:34.666590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-14 07:16:34.666743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.667562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.668297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-14 07:17:53.690634: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-14 07:17:53.714191: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300050000 Hz
2020-05-14 07:17:53.714472: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3970000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-14 07:17:53.714508: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-14 07:17:53.816549: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.817400: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555a5231f6d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-14 07:17:53.817432: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-05-14 07:17:53.817690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.818466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-05-14 07:17:53.818528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:17:53.818570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-14 07:17:53.818598: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-14 07:17:53.818625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-14 07:17:53.818677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-14 07:17:53.818722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-14 07:17:53.818752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-14 07:17:53.818859: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.819637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.820344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-14 07:17:53.820395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:17:53.822110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-14 07:17:53.822137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-05-14 07:17:53.822153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-05-14 07:17:53.822277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.823052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.823792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10691 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Label count: 19
Model defined.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
masking (Masking)            (None, 200, 311)          0
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 512)          873984
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 19)           9747
=================================================================
Total params: 883,731
Trainable params: 883,731
Non-trainable params: 0
_________________________________________________________________
None
Saved model architecture to candidate_architecture.json
2020-05-14 07:18:19.476160: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 14639392000 exceeds 10% of free system memory.
2020-05-14 07:18:40.240359: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 13.63GiB (rounded to 14639392000)
Current allocation summary follows.
2020-05-14 07:18:40.243374: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
2020-05-14 07:18:40.243393: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256):   Total Chunks: 11, Chunks in use: 11. 2.8KiB allocated for chunks. 2.8KiB in use in bin. 116B client-requested in use in bin.
2020-05-14 07:18:40.243409: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243434: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243451: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243474: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4096):  Total Chunks: 2, Chunks in use: 2. 12.0KiB allocated for chunks. 12.0KiB in use in bin. 12.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243496: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243568: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16384):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243595: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (32768):         Total Chunks: 3, Chunks in use: 1. 113.5KiB allocated for chunks. 38.0KiB in use in bin. 38.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243630: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (65536):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243655: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (131072):        Total Chunks: 1, Chunks in use: 0. 130.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243698: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (262144):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243721: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (524288):        Total Chunks: 5, Chunks in use: 2. 4.07MiB allocated for chunks. 1.66MiB in use in bin. 1.50MiB client-requested in use in bin.
2020-05-14 07:18:40.243756: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1048576):       Total Chunks: 2, Chunks in use: 2. 2.68MiB allocated for chunks. 2.68MiB in use in bin. 1.82MiB client-requested in use in bin.
2020-05-14 07:18:40.243773: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243782: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4194304):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243792: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8388608):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243825: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16777216):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243841: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (33554432):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243861: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (67108864):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243893: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243916: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243934: I tensorflow/core/common_runtime/bfc_allocator.cc:957] Bin for 13.63GiB was 256.00MiB, Chunk State:
2020-05-14 07:18:40.243949: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 1048576
2020-05-14 07:18:40.243962: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203be0000 of size 1280 next 1
2020-05-14 07:18:40.243976: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203be0500 of size 256 next 5
2020-05-14 07:18:40.243983: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203be0600 of size 1047040 next 18446744073709551615
2020-05-14 07:18:40.243994: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 2097152
2020-05-14 07:18:40.244014: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0000 of size 256 next 3
2020-05-14 07:18:40.244027: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0100 of size 256 next 4
2020-05-14 07:18:40.244039: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0200 of size 256 next 10
2020-05-14 07:18:40.244058: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0300 of size 6144 next 14
2020-05-14 07:18:40.244071: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee1b00 of size 6144 next 17
2020-05-14 07:18:40.244086: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3300 of size 256 next 20
2020-05-14 07:18:40.244103: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3400 of size 256 next 23
2020-05-14 07:18:40.244121: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3500 of size 256 next 24
2020-05-14 07:18:40.244134: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203ee3600 of size 38144 next 15
2020-05-14 07:18:40.244151: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203eecb00 of size 256 next 18
2020-05-14 07:18:40.244164: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203eecc00 of size 256 next 19
2020-05-14 07:18:40.244171: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203eecd00 of size 39168 next 21
2020-05-14 07:18:40.244179: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ef6600 of size 38912 next 22
2020-05-14 07:18:40.244187: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203effe00 of size 133120 next 11
2020-05-14 07:18:40.244193: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203f20600 of size 256 next 13
2020-05-14 07:18:40.244204: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203f20700 of size 692224 next 6
2020-05-14 07:18:40.244223: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203fc9700 of size 1140992 next 18446744073709551615
2020-05-14 07:18:40.244240: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 4194304
2020-05-14 07:18:40.244253: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 12040e0000 of size 256 next 8
2020-05-14 07:18:40.244269: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 12040e0100 of size 786432 next 9
2020-05-14 07:18:40.244283: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 12041a0100 of size 786432 next 12
2020-05-14 07:18:40.244297: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1204260100 of size 955392 next 16
2020-05-14 07:18:40.244315: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1204349500 of size 1665792 next 18446744073709551615
2020-05-14 07:18:40.244332: I tensorflow/core/common_runtime/bfc_allocator.cc:995]      Summary of in-use Chunks by size:
2020-05-14 07:18:40.244348: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 11 Chunks of size 256 totalling 2.8KiB
2020-05-14 07:18:40.244366: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1280 totalling 1.2KiB
2020-05-14 07:18:40.244380: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 2 Chunks of size 6144 totalling 12.0KiB
2020-05-14 07:18:40.244390: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 38912 totalling 38.0KiB
2020-05-14 07:18:40.244397: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 786432 totalling 768.0KiB
2020-05-14 07:18:40.244411: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 955392 totalling 933.0KiB
2020-05-14 07:18:40.244418: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1140992 totalling 1.09MiB
2020-05-14 07:18:40.244433: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1665792 totalling 1.59MiB
2020-05-14 07:18:40.244447: I tensorflow/core/common_runtime/bfc_allocator.cc:1002] Sum Total of in-use chunks: 4.39MiB
2020-05-14 07:18:40.244464: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 7340032 memory_limit_: 11210358784 available bytes: 11203018752 curr_region_allocation_bytes_: 8388608
2020-05-14 07:18:40.244482: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit:                 11210358784
InUse:                     4603904
MaxInUse:                  6655232
NumAllocs:                      66
MaxAllocSize:              1665792

2020-05-14 07:18:40.244501: W tensorflow/core/common_runtime/bfc_allocator.cc:439] *_____________****_________**************x*__________***********************x**************xxxxxxxxx
Traceback (most recent call last):
  File "modules/training-suite-controller/training-suite-controller/train_open_model.py", line 108, in <module>
    process(training_setup)
  File "modules/training-suite-controller/training-suite-controller/train_open_model.py", line 85, in process
    batch_size=training_setup['batch_size'], callbacks=callbacks_list)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 797, in fit
    shuffle=False))
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1338, in train_validation_split
    functools.partial(_split, indices=train_indices), arrays)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1335, in _split
    return array_ops.gather_v2(t, indices)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4541, in gather_v2
    batch_dims=batch_dims)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4524, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3755, in gather_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]

The text was updated successfully, but these errors were encountered:

mihaimaruseac · 2020-05-15T16:16:07Z

Can you compare behavior on 1.15, 2.0 and 2.1 too please?

philiprekers · 2020-05-18T15:53:29Z

Thank you for your response.

I'll compare this on 1.15, 2.0 and 2.1 aswell. I might just not be able to do it until next week.

Until then:

Does anyone know what could be the cause?
Can there be a workaround?
Why does the allow_growth option not work?

fused-byte · 2020-05-19T10:10:50Z

I am also facing the issue of Out of memory. I am using tensorflow 2.0 and have enabled the allow_growth option.

ravikyram · 2020-05-20T14:16:47Z

I have tried in colab with TF-GPU with version 1.15,2.0,2.1, 2.2 and was able to reproduce the issue.Please, find the gist here.Thanks!

philiprekers · 2020-05-25T08:31:17Z

@ravikyram thank you for reproducing the problem!

If you need me to provide any further information to this, please don't hesitate to comment.

philiprekers · 2020-05-28T10:48:04Z

Just tested the same procedure (with TF 1.15 and TF 2.2) on a larger AWS instance (p3.8xlarge with 4 GPUs).
Used a fix for memory distribution (#30321 (comment)).
Even in this setting, TF 1.15 worked well and TF2.2 produced the OOM error.

@ravikyram Maybe this helps assessing the scope of the problem.

ymodak · 2020-06-06T04:26:26Z

@Philipduerholt Did you update your cuda version to 10.1 when using TF 2.2?

philiprekers · 2020-06-06T13:58:23Z

@ymodak No, the CUDA version is 10.0.130

ymodak · 2020-06-08T04:23:23Z

The example code crashes on google colab with TF version 1.14 as well as 2.2 on cpu/gpu runtime.

rmothukuru · 2021-06-02T14:18:11Z

Could reproduce the issue with Tensorflow Version 2.5 and Tensorflow Nightly (2.6.0-dev20210602). Please find the Gist. Thanks!

tomerk · 2021-06-02T18:19:50Z

Keras's numpy support involves data copies and is not designed for performant processing of large numpy datasets. I suggest either using tf.data to construct your inputs directly, or give from_numpy from Tensorflow IO (to efficiently turn a numpy array into a tf.data dataset) a try: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/IODataset#from_numpy

tomerk · 2021-06-29T18:17:05Z

Feel free to reopen if the above suggestion does not work for you.

philiprekers added the type:bug Bug label May 15, 2020

google-ml-butler bot assigned ravikyram May 15, 2020

philiprekers changed the title ~~Training produces out of Memory error with TF 2.2.0 but works with TF 1.14~~ Training produces Out Of Memory error with TF 2.2.0 but works with TF 1.14 May 15, 2020

ravikyram added comp:keras Keras related issues TF 2.2 Issues related to TF 2.2 stat:awaiting response Status - Awaiting response from author labels May 18, 2020

ravikyram assigned ymodak and unassigned ravikyram May 20, 2020

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label May 22, 2020

jarednielsen mentioned this issue May 30, 2020

tf.keras.model fit() significantly slower when using weighted validation data in comparison to tf2.1.0 #39588

Closed

fzyai mentioned this issue Jun 5, 2020

Always exceeds 10% of free system memory. #40178

Closed

ymodak added the stat:awaiting response Status - Awaiting response from author label Jun 6, 2020

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 8, 2020

ymodak added type:performance Performance Issue and removed type:bug Bug labels Jun 8, 2020

ymodak assigned tomerk and unassigned ymodak Jun 8, 2020

gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 8, 2020

mihaimaruseac changed the title ~~Training produces Out Of Memory error with TF 2.2.0 but works with TF 1.14~~ Training produces Out Of Memory error with TF 2.* but works with TF 1.14 Jun 8, 2021

mihaimaruseac added the TF 2.5 Issues related to TF 2.5 label Jun 8, 2021

tomerk closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

Comments