[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to sync the journal of data service with checkpoint? #63079

Open
runitao opened this issue Feb 28, 2024 · 0 comments
Open

Is there a way to sync the journal of data service with checkpoint? #63079

runitao opened this issue Feb 28, 2024 · 0 comments
Assignees
Labels
2.6.0 comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues

Comments

@runitao
Copy link
runitao commented Feb 28, 2024

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

source

TensorFlow version

tf 2.6

Custom code

No

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

In my app, I set fault_tolerant_mode True with data service, and save_checkpoint_steps is 5000 in tf.train.MonitoredTrainingSession.
For example, the app is killed when the global step is 9500, and then I resume the app with checkpoint from the step 5000, but the data service will continue consume the data from the step 9500. It skip about 4500 steps data.

Is there a way to sync the journal of data service with checkpoint, or make the skip less?

Standalone code to reproduce the issue

# dispatcher server snippet
dispatcher_server = tf.data.experimental.service.DispatchServer(
    tf.data.experimental.service.DispatcherConfig(
        fault_tolerant_mode=True,
        work_dir=...
    ))

# train worker
dataset = dataset.apply(tf.data.experimental.service.distribute(service=dispatcher_service,...)

with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       save_checkpoint_steps=5000, ...) as mon_sess:

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:support Support issues label Feb 28, 2024
@SuryanarayanaY SuryanarayanaY added comp:data tf.data related issues 2.6.0 labels Mar 1, 2024
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6.0 comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues
Projects
None yet
Development

No branches or pull requests

4 participants