[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.keras.model fit() significantly slower when using weighted validation data in comparison to tf2.1.0 #39588

Closed
sirvincent opened this issue May 15, 2020 · 16 comments
Assignees
Labels
comp:keras Keras related issues regression issue To spot regression issues in latest version stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.2 Issues related to TF 2.2 type:performance Performance Issue

Comments

@sirvincent
Copy link

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    ArchLinux & Ubuntu 18.04 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    No
  • TensorFlow installed from (source or binary):
    binary
  • TensorFlow version (use command below):
    v2.2.0-rc4-8-g2b96f3662b 2.2.0
    (compared to: v2.1.0-rc2-17-ge5bf8de 2.1.0)
  • Python version:
    3.7.5

The ArchLinux machine runs on CPU
The Ubuntu machine runs on GPU with:

  • CUDA/cuDNN version:
    10.1.243
  • GPU model and memory:
    GeForce GTX 1080 with 7126 MB memory

Describe the current behavior
When training a simple tf.keras.model multilayer perceptron with a call to .fit() containing a validation_data that contains weights results in a significant slower fit() then in comparison to TensorFlow 2.1.0 with the exact same code.

Describe the expected behavior
Similar performance between TensorFlow 2.1.0 and 2.2.0 when training a tf.keras.model with a weighted validation data set.

Standalone code to reproduce the issue
Package requirements for code snippet using python 3.7.5:

numpy= "==1.18.2"
tensorflow = "==2.2.0"
tensorflow-datasets = "==3.1.0"
import typing

import numpy as np
from tensorflow import keras
import tensorflow_datasets as tfds


def build_neural_network(input_dimension: int, number_of_classes: int, compile_options: dict):
    model = keras.Sequential()
    model.add(keras.layers.Dense(112, activation='relu', input_dim=input_dimension))
    model.add(keras.layers.Dense(112, activation='relu'))
    model.add(keras.layers.Dense(number_of_classes, activation='softmax'))

    model.compile(**compile_options)

    print(model.summary())

    return model

def load_in_images_and_labels_and_reshape(dataset) -> typing.Tuple[np.ndarray, np.ndarray]:
    images = []
    labels = []
    for image, label in tfds.as_numpy(dataset):
        new_image_shape = image.shape[0] * image.shape[1]
        images.append(image.reshape(new_image_shape))
        labels.append(label)

    return np.array(images), np.array(labels)


def train_neural_network(is_random_weighing: bool):
    dataset_train      = tfds.load('emnist', split='train', as_supervised=True)
    dataset_validation = tfds.load('emnist', split='test', as_supervised=True)

    train_images, train_labels           = load_in_images_and_labels_and_reshape(dataset_train)
    validation_images, validation_labels = load_in_images_and_labels_and_reshape(dataset_validation)
    train_labels      = keras.utils.to_categorical(train_labels)
    validation_labels = keras.utils.to_categorical(validation_labels)

    print("load")
    compile_options =  {
        "loss": "categorical_crossentropy",
        "optimizer": "adam",
        "metrics": ["categorical_accuracy"],
        "weighted_metrics": ["categorical_accuracy"]
    }
    network = build_neural_network(train_images.shape[-1], len(train_labels[0]), compile_options)

    fit_options = {    
        "batch_size": 2048,
        "epochs": 10,
        "verbose": 1,
        "workers": 1
    }
    if is_random_weighing:
        random_weights = np.random.rand(len(validation_images))
        validation_data_tuple = (validation_images, validation_labels, random_weights)
    else:
        validation_data_tuple = (validation_images, validation_labels)
    history = network.fit(train_images, train_labels, validation_data=validation_data_tuple, **fit_options)


if __name__ == "__main__":
    is_random_weighing = True
    train_neural_network(is_random_weighing)

Other info / logs
Running the above code snippet on the ArchLinux machine, run on CPU:
takes roughly 19 seconds per epoch. When the same code is run in TensorFlow 2.1.0 it takes roughly 5 seconds per epoch. When the weighing off the validation dataset is turned off with TensorFlow 2.2.0 (is_random_weighing = False) the performance becomes similar to TensorFlow 2.1.0; roughly 5 seconds per epoch.
The slowdown is also seen on the Ubuntu machine, run on GPU, but then due to likely different hardware, tf 2.2.0 is 7 times as slow as tf 2.1.0.

The effect was not seen (but maybe it was not measurable) when using mnist in place of emnist.

The issue seems related to: #39039
In which the comment by @romanovzky brought to light that it might be due to the validation data or validation split. Although that is in the context of comparing a tensorflow estimator to keras.

This issue also seems related to: #39434
In which also from tf2.1 to tf.2.2 a significant performance drop is seen.

It seems like another small puzzle piece in a larger puzzle (or I do something simple wrong on both machines).

@sirvincent sirvincent added the type:performance Performance Issue label May 15, 2020
@sirvincent sirvincent changed the title tf.keras.model fit() significantly slower when using using weighted validation data in comparison to tf2.1.0 tf.keras.model fit() significantly slower when using weighted validation data in comparison to tf2.1.0 May 15, 2020
@amahendrakar
Copy link
Contributor

Was able to reproduce the issue. TF v2.2 and TF-nightly take more time for each epoch when compared to TF v2.1. Please find the attached gist. Thanks!

@amahendrakar amahendrakar added comp:keras Keras related issues TF 2.2 Issues related to TF 2.2 labels May 18, 2020
@jvishnuvardhan jvishnuvardhan added the regression issue To spot regression issues in latest version label May 19, 2020
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 19, 2020
@jarednielsen
Copy link
Contributor

Any update on these performance problems in TF 2.2? This issue is one of many; see also #39665 and #38675 and #39574 and #39434.

What is the status?

@edwardyehuang
Copy link
Contributor

Tensorflow 2.2 takes much more time than 2.1/2.0 to start training, after called "keras.fit".

2020-06-01 10:16:44.991459: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-01 10:16:46.235945: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2020-06-01 10:16:46.328871: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-06-01 10:16:48.148004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-01 10:23:36.473814: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session started.

It stucks about 7 mins to start training.

@edwardyehuang
Copy link
Contributor

Log from 2.1

INFO:tensorflow:batch_all_reduce: 436 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0531 20:55:03.956965 139684401002304 cross_device_ops.py:760] batch_all_reduce: 436 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 436 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0531 20:55:14.695299 139684401002304 cross_device_ops.py:760] batch_all_reduce: 436 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-05-31 20:55:39.932592: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-31 20:55:41.811100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-31 20:55:48.718710: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.

@jarednielsen
Copy link
Contributor

Interesting that the random weighing causes the performance slowdown. In my case, turning on dropout layers (even with dropout_prob=0) causes the performance slowdown. Could it be something in the tensorflow randomness modules?

@sirvincent
Copy link
Author
sirvincent commented Jun 9, 2020

What is the status of this issue? Is someone actively looking into it? If not is there any estimation when someone might look at this issue?
It is understandable that handling issues might take a while, especially due to the huge quantity tensorflow receive!

Is there something I can do to help?

Currently this issue circumvents us from updating to tensorflow 2.2 and thus updating to python 3.8. Luckily python 3.8 or tensorflow 2.2 is not yet a requirement.

@goldiegadde
Copy link
Contributor

@sirvincent thanks for reporting the issue, a fix was submitted in 1d2d05f. that is available in the latest nightly.

@goldiegadde goldiegadde added this to In progress in TensorFlow 2.3.0 Jun 17, 2020
@sirvincent
Copy link
Author

Thanks @goldiegadde I have tested tf-nightly 2.3.0.dev20200619 and the issue seems to be fixed.
Thank you!

TensorFlow 2.3.0 automation moved this from In progress to Done Jun 19, 2020
@romanovzky
Copy link
romanovzky commented Aug 2, 2020

This regretion is not completely fixed with 2.3.0. It seems that, for whatever reason, the first epoch needs a long time to start, and the first validation step is also very slow. From the 2nd epoch onward, the epoch times are comparable.
This can be reproduced in this colab. EDIT: Colab link removed as it is pointing to another colab, I have lost (probably deleted) the original one.

@jvishnuvardhan
Copy link
Contributor

@romanovzky Can you please open a new issue with the gist (you already have one above). Thanks!

@MarioTro
Copy link

I had the same issue and was able to circumvent it by converting my weights numpy-array into a pandas series. Training now starts immediately and I do not have to wait anymore.
pd.Series(my_weights)

@meni432
Copy link
meni432 commented Feb 18, 2021

I had the same issue and was able to circumvent it by converting my weights numpy-array into a pandas series. Training now starts immediately and I do not have to wait anymore.
pd.Series(my_weights)

This is works for me, but how that actual works? the API isn't only numpy array?

@romanovzky
Copy link

The problem is also fixed if you use a generator (keras Sequence), which is what I have been using.

@LucaCappelletti94
Copy link

I had the same issue and was able to circumvent it by converting my weights numpy-array into a pandas series. Training now starts immediately and I do not have to wait anymore.
pd.Series(my_weights)

If this work, I call sorcery. Thank you!

@Brentbin
Copy link
Brentbin commented Nov 3, 2021

I had the same issue and was able to circumvent it by converting my weights numpy-array into a pandas series. Training now starts immediately and I do not have to wait anymore. pd.Series(my_weights)

It's work for me

@nershman
Copy link
nershman commented Mar 5, 2022

New issue has been opened recently: #48965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues regression issue To spot regression issues in latest version stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.2 Issues related to TF 2.2 type:performance Performance Issue
Projects
Development

No branches or pull requests