Is there a way to sync the journal of data service with checkpoint? #63079
Labels
2.6.0
comp:data
tf.data related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
type:support
Support issues
Issue type
Support
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
tf 2.6
Custom code
No
OS platform and distribution
No response
Mobile device
No response
Python version
No response
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
In my app, I set fault_tolerant_mode True with data service, and save_checkpoint_steps is 5000 in tf.train.MonitoredTrainingSession.
For example, the app is killed when the global step is 9500, and then I resume the app with checkpoint from the step 5000, but the data service will continue consume the data from the step 9500. It skip about 4500 steps data.
Is there a way to sync the journal of data service with checkpoint, or make the skip less?
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: