[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

Closed
Keno opened this issue Oct 4, 2018 · 2 comments
Closed

[Cloud TPU] Intermittent freezes requiring reset of the TPU #22710

Keno opened this issue Oct 4, 2018 · 2 comments
Assignees
Labels
comp:tpus tpu, tpuestimator stat:awaiting response Status - Awaiting response from author

Comments

@Keno
Copy link
Contributor
Keno commented Oct 4, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version (use command below): r1.11
  • Python version: N/A
  • Bazel version (if compiling from source): 0.16.1
  • GCC/Compiler version (if compiling from source): 7.2.0
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • Exact command to reproduce: N/A

Describe the problem

We are seeing intermittent freezes of the Cloud TPU when running VGG19 inference from the Julia frontend via xrt (do note that we're also seeing incorrect answers, which is filed as #22709 and may or may not be related). If the Cloud TPU freezes, the TF_SessionRun call never returns. An aggravating, but non-essential factor is the concurrent execution of the TPU profiler via capture_tpu_profile. In that situation both the main program and the capture_tpu_profile invocation never exit. If the profiler is running, we usually see only 2-3 successful runs until things start hanging. If I manually disconnect and forcefully kill the session (by severing the socket) and then reconnect, I see either continued freezes or errors of the form

ERROR: Tensorflow error: Status: Unable to enqueue when not opened, queue: [0000:00:04.0 PE0 C0 MC0 TN0 Queue HBM_WRITE]. State is: CLOSED
	 [[{{node XRTAllocate}} = XRTAllocate[_device="/job:tpu_worker/replica:0/task:0/device:TPU:0"](XRTAllocate/Const_G1)]]

my standard protocol to recover from this has been the following:

  • Reconnect the session
  • Run the ShutdownDistributedTPU op (Remote session closes)
  • Reconnect the session
  • Run the ShutdownDistributedTPU op (Causes a non-fatal error, but the next step hangs if not done)
  • Run the ConfigureDistributedTPU op (works)

afterwards I can usually use the TPU again.

Source code / logs

XLA dump (batch size N=1, but the same happens for at least N=10 and N=100): https://gist.github.com/Keno/b54e4be146096daf4d464c1319639404

@ymodak ymodak assigned saeta and unassigned ymodak Oct 15, 2018
@saeta saeta assigned tatatodd and fdxmw and unassigned saeta Oct 15, 2018
@tilakrayal tilakrayal added the comp:tpus tpu, tpuestimator label Jun 3, 2022
@tilakrayal tilakrayal assigned tilakrayal and unassigned tilakrayal Jun 3, 2022
@tilakrayal tilakrayal added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 6, 2022
@chunduriv chunduriv self-assigned this Sep 24, 2022
@chunduriv
Copy link
Contributor

@Keno, Sorry for late response. We are checking to see if you still need help on this issue.

You are using the old version of tensorflow (1.x) which is not supported. We recommend that you upgrade to 2.10.0 and let us know if the issue still persists in the newer version? Thank you.

@chunduriv chunduriv added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 24, 2022
@Keno
Copy link
Contributor Author
Keno commented Sep 24, 2022

This code path is obsolete - I don't think anyone is using it.

@Keno Keno closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:tpus tpu, tpuestimator stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

7 participants