[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training produces Out Of Memory error with TF 2.* but works with TF 1.14 #39574

Closed
philiprekers opened this issue May 15, 2020 · 12 comments
Closed
Assignees
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.2 Issues related to TF 2.2 TF 2.5 Issues related to TF 2.5 type:performance Performance Issue

Comments

@philiprekers
Copy link
philiprekers commented May 15, 2020

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.4 LTS
  • TensorFlow installed from (source or binary): pip 20.1
  • TensorFlow version (use command below): v2.2.0-rc4-8-g2b96f3662b 2.2.0
  • Python version: Python 3.7.7
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • CUDA/cuDNN version: Cuda compilation tools, release 10.0, V10.0.130
  • GPU model and memory: Nvidia K80 12GB (AWS p2.xlarge Deep Learning AMI (Ubuntu 18.04) Version 28.1)

Describe the current behavior
Training a simple model on a data set:
X shape: (35000, 200, 311)
y shape: (35000, 200, 19)
Right before the start of the first epoch, when using TF 2.2.0 the training stops with an error:
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]
(abbreviated for readability and because this error seems to point to Out Of Memory issues indirectly)
(also because the code example below can reproduce the error message)

Interestingly: The exact same setup, model and training set trains well on TF 1.14

NOTES:

  • I've tried changing the batch size (down to 1) to no avail.
  • Decreasing Training Set to ~10,000 test cases works fine.
  • Error also occurs on freshly re-started instances.
  • allow_growth option didn't make a difference.
  • GPU memory is being used by training (confirmed by looking at GPU usage and by showing found GPU devices)

Describe the expected behavior
Training above model runs as expected and does not error out.

Standalone code to reproduce the issue
The colab link can be executed, but if not run on the specified hardware, it wouldn't make sense to test this way.

Best to reproduce is running following code on the described setup with TF 1.14 and TF 2.2.0

import json
import numpy as np
from tensorflow.keras.layers import GRU, Bidirectional, Dense, Masking, Input, TimeDistributed
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

y_raw = np.random.randint(19, size=(35000,200))
y = to_categorical(y_raw, num_classes=19)
X = np.random.rand(35000, 200, 311)

model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(200,311)))
model.add(
            Bidirectional(
                        GRU(
                            256,
                            return_sequences=True,
                            unroll=True,
                            recurrent_dropout=0.233,
                            recurrent_activation="sigmoid"
                           )
                         )
         )

model.add(TimeDistributed(Dense(19, activation="softmax")))

model.compile(
            loss="categorical_crossentropy",
            optimizer="rmsprop",
            metrics=["categorical_accuracy"],
            )

model.summary()

architecture_path = "candidate_architecture.json"
model_json = model.to_json()
with open(architecture_path, 'w') as json_file:
    json_file.write(model_json)
print(f"Saved model architecture to {architecture_path}")

filepath = "candidate_model-{epoch:02d}-{val_categorical_accuracy:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min', save_weights_only=True)
callbacks_list = [checkpoint]

history = model.fit(X, y, epochs=10,
                    validation_split=0.2,
                    batch_size=16,
                    callbacks=callbacks_list)

Other info / logs
Full traceback of error:

2020-05-14 07:16:34.619007: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-14 07:16:34.652784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.653592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-05-14 07:16:34.653944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:16:34.656389: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-14 07:16:34.658467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-14 07:16:34.658882: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-14 07:16:34.661237: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-14 07:16:34.662515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-14 07:16:34.666590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-14 07:16:34.666743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.667562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:16:34.668297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-14 07:17:53.690634: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-14 07:17:53.714191: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300050000 Hz
2020-05-14 07:17:53.714472: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3970000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-14 07:17:53.714508: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-14 07:17:53.816549: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.817400: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555a5231f6d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-14 07:17:53.817432: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-05-14 07:17:53.817690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.818466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-05-14 07:17:53.818528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:17:53.818570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-14 07:17:53.818598: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-14 07:17:53.818625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-14 07:17:53.818677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-14 07:17:53.818722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-14 07:17:53.818752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-14 07:17:53.818859: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.819637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.820344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-14 07:17:53.820395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-14 07:17:53.822110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-14 07:17:53.822137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-05-14 07:17:53.822153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-05-14 07:17:53.822277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.823052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-14 07:17:53.823792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10691 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer gru will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Label count: 19
Model defined.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
masking (Masking)            (None, 200, 311)          0
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 512)          873984
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 19)           9747
=================================================================
Total params: 883,731
Trainable params: 883,731
Non-trainable params: 0
_________________________________________________________________
None
Saved model architecture to candidate_architecture.json
2020-05-14 07:18:19.476160: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 14639392000 exceeds 10% of free system memory.
2020-05-14 07:18:40.240359: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 13.63GiB (rounded to 14639392000)
Current allocation summary follows.
2020-05-14 07:18:40.243374: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
2020-05-14 07:18:40.243393: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256):   Total Chunks: 11, Chunks in use: 11. 2.8KiB allocated for chunks. 2.8KiB in use in bin. 116B client-requested in use in bin.
2020-05-14 07:18:40.243409: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243434: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243451: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243474: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4096):  Total Chunks: 2, Chunks in use: 2. 12.0KiB allocated for chunks. 12.0KiB in use in bin. 12.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243496: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243568: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16384):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243595: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (32768):         Total Chunks: 3, Chunks in use: 1. 113.5KiB allocated for chunks. 38.0KiB in use in bin. 38.0KiB client-requested in use in bin.
2020-05-14 07:18:40.243630: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (65536):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243655: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (131072):        Total Chunks: 1, Chunks in use: 0. 130.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243698: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (262144):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243721: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (524288):        Total Chunks: 5, Chunks in use: 2. 4.07MiB allocated for chunks. 1.66MiB in use in bin. 1.50MiB client-requested in use in bin.
2020-05-14 07:18:40.243756: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (1048576):       Total Chunks: 2, Chunks in use: 2. 2.68MiB allocated for chunks. 2.68MiB in use in bin. 1.82MiB client-requested in use in bin.
2020-05-14 07:18:40.243773: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (2097152):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243782: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (4194304):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243792: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (8388608):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243825: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (16777216):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243841: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (33554432):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243861: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (67108864):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243893: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243916: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-14 07:18:40.243934: I tensorflow/core/common_runtime/bfc_allocator.cc:957] Bin for 13.63GiB was 256.00MiB, Chunk State:
2020-05-14 07:18:40.243949: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 1048576
2020-05-14 07:18:40.243962: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203be0000 of size 1280 next 1
2020-05-14 07:18:40.243976: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203be0500 of size 256 next 5
2020-05-14 07:18:40.243983: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203be0600 of size 1047040 next 18446744073709551615
2020-05-14 07:18:40.243994: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 2097152
2020-05-14 07:18:40.244014: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0000 of size 256 next 3
2020-05-14 07:18:40.244027: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0100 of size 256 next 4
2020-05-14 07:18:40.244039: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0200 of size 256 next 10
2020-05-14 07:18:40.244058: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee0300 of size 6144 next 14
2020-05-14 07:18:40.244071: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee1b00 of size 6144 next 17
2020-05-14 07:18:40.244086: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3300 of size 256 next 20
2020-05-14 07:18:40.244103: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3400 of size 256 next 23
2020-05-14 07:18:40.244121: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ee3500 of size 256 next 24
2020-05-14 07:18:40.244134: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203ee3600 of size 38144 next 15
2020-05-14 07:18:40.244151: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203eecb00 of size 256 next 18
2020-05-14 07:18:40.244164: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203eecc00 of size 256 next 19
2020-05-14 07:18:40.244171: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203eecd00 of size 39168 next 21
2020-05-14 07:18:40.244179: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203ef6600 of size 38912 next 22
2020-05-14 07:18:40.244187: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203effe00 of size 133120 next 11
2020-05-14 07:18:40.244193: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203f20600 of size 256 next 13
2020-05-14 07:18:40.244204: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 1203f20700 of size 692224 next 6
2020-05-14 07:18:40.244223: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1203fc9700 of size 1140992 next 18446744073709551615
2020-05-14 07:18:40.244240: I tensorflow/core/common_runtime/bfc_allocator.cc:970] Next region of size 4194304
2020-05-14 07:18:40.244253: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 12040e0000 of size 256 next 8
2020-05-14 07:18:40.244269: I tensorflow/core/common_runtime/bfc_allocator.cc:990] Free  at 12040e0100 of size 786432 next 9
2020-05-14 07:18:40.244283: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 12041a0100 of size 786432 next 12
2020-05-14 07:18:40.244297: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1204260100 of size 955392 next 16
2020-05-14 07:18:40.244315: I tensorflow/core/common_runtime/bfc_allocator.cc:990] InUse at 1204349500 of size 1665792 next 18446744073709551615
2020-05-14 07:18:40.244332: I tensorflow/core/common_runtime/bfc_allocator.cc:995]      Summary of in-use Chunks by size:
2020-05-14 07:18:40.244348: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 11 Chunks of size 256 totalling 2.8KiB
2020-05-14 07:18:40.244366: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1280 totalling 1.2KiB
2020-05-14 07:18:40.244380: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 2 Chunks of size 6144 totalling 12.0KiB
2020-05-14 07:18:40.244390: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 38912 totalling 38.0KiB
2020-05-14 07:18:40.244397: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 786432 totalling 768.0KiB
2020-05-14 07:18:40.244411: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 955392 totalling 933.0KiB
2020-05-14 07:18:40.244418: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1140992 totalling 1.09MiB
2020-05-14 07:18:40.244433: I tensorflow/core/common_runtime/bfc_allocator.cc:998] 1 Chunks of size 1665792 totalling 1.59MiB
2020-05-14 07:18:40.244447: I tensorflow/core/common_runtime/bfc_allocator.cc:1002] Sum Total of in-use chunks: 4.39MiB
2020-05-14 07:18:40.244464: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] total_region_allocated_bytes_: 7340032 memory_limit_: 11210358784 available bytes: 11203018752 curr_region_allocation_bytes_: 8388608
2020-05-14 07:18:40.244482: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] Stats:
Limit:                 11210358784
InUse:                     4603904
MaxInUse:                  6655232
NumAllocs:                      66
MaxAllocSize:              1665792

2020-05-14 07:18:40.244501: W tensorflow/core/common_runtime/bfc_allocator.cc:439] *_____________****_________**************x*__________***********************x**************xxxxxxxxx
Traceback (most recent call last):
  File "modules/training-suite-controller/training-suite-controller/train_open_model.py", line 108, in <module>
    process(training_setup)
  File "modules/training-suite-controller/training-suite-controller/train_open_model.py", line 85, in process
    batch_size=training_setup['batch_size'], callbacks=callbacks_list)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 797, in fit
    shuffle=False))
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1338, in train_validation_split
    functools.partial(_split, indices=train_indices), arrays)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/nest.py", line 617, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1335, in _split
    return array_ops.gather_v2(t, indices)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4541, in gather_v2
    batch_dims=batch_dims)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 4524, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3755, in gather_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/snapaddy-deep-learning-training-suite-YlZOSDlZ-py3.7/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]
@philiprekers philiprekers changed the title Training produces out of Memory error with TF 2.2.0 but works with TF 1.14 Training produces Out Of Memory error with TF 2.2.0 but works with TF 1.14 May 15, 2020
@mihaimaruseac
Copy link
Collaborator

Can you compare behavior on 1.15, 2.0 and 2.1 too please?

@ravikyram ravikyram added comp:keras Keras related issues TF 2.2 Issues related to TF 2.2 stat:awaiting response Status - Awaiting response from author labels May 18, 2020
@philiprekers
Copy link
Author

Thank you for your response.

I'll compare this on 1.15, 2.0 and 2.1 aswell. I might just not be able to do it until next week.

Until then:

  • Does anyone know what could be the cause?
  • Can there be a workaround?
  • Why does the allow_growth option not work?

@fused-byte
Copy link

I am also facing the issue of Out of memory. I am using tensorflow 2.0 and have enabled the allow_growth option.

@ravikyram
Copy link
Contributor

I have tried in colab with TF-GPU with version 1.15,2.0,2.1, 2.2 and was able to reproduce the issue.Please, find the gist here.Thanks!

@ravikyram ravikyram assigned ymodak and unassigned ravikyram May 20, 2020
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label May 22, 2020
@philiprekers
Copy link
Author

@ravikyram thank you for reproducing the problem!

If you need me to provide any further information to this, please don't hesitate to comment.

@philiprekers
Copy link
Author

Just tested the same procedure (with TF 1.15 and TF 2.2) on a larger AWS instance (p3.8xlarge with 4 GPUs).
Used a fix for memory distribution (#30321 (comment)).
Even in this setting, TF 1.15 worked well and TF2.2 produced the OOM error.

@ravikyram Maybe this helps assessing the scope of the problem.

@ymodak
Copy link
Contributor
ymodak commented Jun 6, 2020

@Philipduerholt Did you update your cuda version to 10.1 when using TF 2.2?

@ymodak ymodak added the stat:awaiting response Status - Awaiting response from author label Jun 6, 2020
@philiprekers
Copy link
Author

@ymodak No, the CUDA version is 10.0.130

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Jun 8, 2020
@ymodak ymodak added type:performance Performance Issue and removed type:bug Bug labels Jun 8, 2020
@ymodak
Copy link
Contributor
ymodak commented Jun 8, 2020

The example code crashes on google colab with TF version 1.14 as well as 2.2 on cpu/gpu runtime.

@ymodak ymodak assigned tomerk and unassigned ymodak Jun 8, 2020
@gowthamkpr gowthamkpr added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 8, 2020
@rmothukuru
Copy link
Contributor

Could reproduce the issue with Tensorflow Version 2.5 and Tensorflow Nightly (2.6.0-dev20210602). Please find the Gist. Thanks!

@tomerk
Copy link
tomerk commented Jun 2, 2021

Keras's numpy support involves data copies and is not designed for performant processing of large numpy datasets. I suggest either using tf.data to construct your inputs directly, or give from_numpy from Tensorflow IO (to efficiently turn a numpy array into a tf.data dataset) a try: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/IODataset#from_numpy

@mihaimaruseac mihaimaruseac changed the title Training produces Out Of Memory error with TF 2.2.0 but works with TF 1.14 Training produces Out Of Memory error with TF 2.* but works with TF 1.14 Jun 8, 2021
@mihaimaruseac mihaimaruseac added the TF 2.5 Issues related to TF 2.5 label Jun 8, 2021
@tomerk tomerk closed this as completed Jun 29, 2021
@tomerk
Copy link
tomerk commented Jun 29, 2021

Feel free to reopen if the above suggestion does not work for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.2 Issues related to TF 2.2 TF 2.5 Issues related to TF 2.5 type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants