[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

Keno · 2018-10-04T00:35:51Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): r1.11
Python version: N/A
Bazel version (if compiling from source): 0.16.1
GCC/Compiler version (if compiling from source): 7.2.0
CUDA/cuDNN version: N/A
GPU model and memory: N/A
Exact command to reproduce: N/A

Describe the problem

We are seeing intermittent freezes of the Cloud TPU when running VGG19 inference from the Julia frontend via xrt (do note that we're also seeing incorrect answers, which is filed as #22709 and may or may not be related). If the Cloud TPU freezes, the TF_SessionRun call never returns. An aggravating, but non-essential factor is the concurrent execution of the TPU profiler via capture_tpu_profile. In that situation both the main program and the capture_tpu_profile invocation never exit. If the profiler is running, we usually see only 2-3 successful runs until things start hanging. If I manually disconnect and forcefully kill the session (by severing the socket) and then reconnect, I see either continued freezes or errors of the form

ERROR: Tensorflow error: Status: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
	 [[{{node XRTAllocate}} = XRTAllocate[_device="/job:tpu_worker/replica:0/task:0/device:TPU:0"](XRTAllocate/Const_G1)]]

my standard protocol to recover from this has been the following:

Reconnect the session
Run the ShutdownDistributedTPU op (Remote session closes)
Reconnect the session
Run the ShutdownDistributedTPU op (Causes a non-fatal error, but the next step hangs if not done)
Run the ConfigureDistributedTPU op (works)

afterwards I can usually use the TPU again.

Source code / logs

XLA dump (batch size N=1, but the same happens for at least N=10 and N=100): https://gist.github.com/Keno/b54e4be146096daf4d464c1319639404

The text was updated successfully, but these errors were encountered:

chunduriv · 2022-09-24T08:24:52Z

@Keno, Sorry for late response. We are checking to see if you still need help on this issue.

You are using the old version of tensorflow (1.x) which is not supported. We recommend that you upgrade to 2.10.0 and let us know if the issue still persists in the newer version? Thank you.

Keno · 2022-09-24T08:35:38Z

This code path is obsolete - I don't think anyone is using it.

tensorflowbutler assigned ymodak Oct 4, 2018

ymodak assigned saeta and unassigned ymodak Oct 15, 2018

saeta assigned tatatodd and fdxmw and unassigned saeta Oct 15, 2018

crystina-z unassigned tatatodd Aug 5, 2019

tilakrayal added the comp:tpus tpu, tpuestimator label Jun 3, 2022

tilakrayal assigned tilakrayal and unassigned tilakrayal Jun 3, 2022

tilakrayal added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 6, 2022

chunduriv self-assigned this Sep 24, 2022

chunduriv added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 24, 2022

Keno closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

Comments

System information

Describe the problem

Source code / logs