You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): r1.11
Python version: N/A
Bazel version (if compiling from source): 0.16.1
GCC/Compiler version (if compiling from source): 7.2.0
CUDA/cuDNN version: N/A
GPU model and memory: N/A
Exact command to reproduce: N/A
Describe the problem
We are seeing intermittent freezes of the Cloud TPU when running VGG19 inference from the Julia frontend via xrt (do note that we're also seeing incorrect answers, which is filed as #22709 and may or may not be related). If the Cloud TPU freezes, the TF_SessionRun call never returns. An aggravating, but non-essential factor is the concurrent execution of the TPU profiler via capture_tpu_profile. In that situation both the main program and the capture_tpu_profile invocation never exit. If the profiler is running, we usually see only 2-3 successful runs until things start hanging. If I manually disconnect and forcefully kill the session (by severing the socket) and then reconnect, I see either continued freezes or errors of the form
ERROR: Tensorflow error: Status: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
[[{{node XRTAllocate}} = XRTAllocate[_device="/job:tpu_worker/replica:0/task:0/device:TPU:0"](XRTAllocate/Const_G1)]]
my standard protocol to recover from this has been the following:
Reconnect the session
Run the ShutdownDistributedTPU op (Remote session closes)
Reconnect the session
Run the ShutdownDistributedTPU op (Causes a non-fatal error, but the next step hangs if not done)
@Keno, Sorry for late response. We are checking to see if you still need help on this issue.
You are using the old version of tensorflow (1.x) which is not supported. We recommend that you upgrade to 2.10.0 and let us know if the issue still persists in the newer version? Thank you.
System information
Describe the problem
We are seeing intermittent freezes of the Cloud TPU when running VGG19 inference from the Julia frontend via xrt (do note that we're also seeing incorrect answers, which is filed as #22709 and may or may not be related). If the Cloud TPU freezes, the
TF_SessionRun
call never returns. An aggravating, but non-essential factor is the concurrent execution of the TPU profiler viacapture_tpu_profile
. In that situation both the main program and thecapture_tpu_profile
invocation never exit. If the profiler is running, we usually see only 2-3 successful runs until things start hanging. If I manually disconnect and forcefully kill the session (by severing the socket) and then reconnect, I see either continued freezes or errors of the formmy standard protocol to recover from this has been the following:
ShutdownDistributedTPU
op (Remote session closes)ShutdownDistributedTPU
op (Causes a non-fatal error, but the next step hangs if not done)ConfigureDistributedTPU
op (works)afterwards I can usually use the TPU again.
Source code / logs
XLA dump (batch size N=1, but the same happens for at least N=10 and N=100): https://gist.github.com/Keno/b54e4be146096daf4d464c1319639404
The text was updated successfully, but these errors were encountered: