-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Public API to preempt tf.train.Server #21492
Comments
@xiejw @ispirmustafa -- what is the recommended way to handle pre-empted Servers for systems like YARN? |
This is a wrong argument to understand the sleep logic in the code.
|
@xiejw thank you for the clarification, and apologies for poor wording on my side. The code fragment I've linked to gives the "delay starting" rationale in the comment.
Would it be right to say that in order for a task to start, it should successfully connect to all other tasks; and that if the connection is unsuccessful, the task will retry indefinitely/"long enough"? I would very much appreciate it if you could reference relevant code in the reply. |
Assume device_filter is not set. The precise sequence can be described as follows:
Say there is a worker in the time.sleep for 100000 secs (basically forever). As the port has been opened in step 1, so other workers can start train (step 3) without waiting. With device filter, only step 3 is slightly different where each worker only tries to find the devices in the list rather than all devices. |
Thank you very much for a detailed reply. Let me do some experiments and get back to you. |
This is indeed a non-issue, thanks again @xiejw! |
@xiejw I thought about this more over a weekend and came up with another use-case for preemption: port reservation. YARN does not manage the network on the containers, so in order to run TF on YARN one has to manually reserve a port on each of the allocated containers, communicate it to all other tasks to assemble a |
@karmel do we have any expert for YARN? I do not understand that environment, so it is very hard for me to give best suggestions here. @superbobry As I mentioned above, I do not understand YARN. So, my suggestions could be wrong or sub-optimal. The minimal knowledge each training job should know is
The ports of other workers are optional. You still need to fill the TF_CONFIG, but the values of other workers do not matter. If the conditions above are met in YARN, then you can construct a device_filter (see example code). The session will try to find the device in that list only even other workers are online (or even the other worker's addresses are fake). |
@xiejw the problem with YARN is that it does not manage the network, i.e. there is no way to reserve a port prior to submitting an application. One possible solution to this is to
which roughly translates to with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.bind(("", 0))
ipaddr, port = sock.getsockname()
cluster_spec = broadcast_addr_and_aggregate_spec(f"{ipaddr}:{port}")
# export TF_CONFIG
tf.estimator.train_and_evaluate(...) Note that this implementation has a race condition between closing I see multiple potential ways of getting rid of the race condition by slightly modifying the existing Python/C++ API of
|
Most TensorFlow programs contain a single server, but this addresses test flakiness where we create multiple servers in the same process. Change: 145293417
@mrry could you comment on the flakiness caused by SO_REUSEPORT? |
It is not needed (and in fact never was), see tensorflow/tensorflow#21492 as it does not remove the race condition, and in the general case does not make it any less probable.
@jhseu -- I hear you might be able to advise on YARN? |
I don't have much personal experience with running on YARN, but the LinkedIn folks do: |
Thanks @jhseu, I had a quick look at the code, and I think they have exactly the same issue. I'd like to point out that nothing above is specific to YARN, nor does it require any familiarity with the Hadoop ecosystem. The gist of the issue is that |
Nagging Assignee @karmel: It has been 47 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
DistributionStrategies is the new API for handling distribution and synchronization, and that API will be preferred in the coming months to the tf.train.Server API. That said, we have now started a Special Interest Group (SIG) for networking issues like these, so while I am closing this issue as it pertains to tf.train.Server, I would encourage you to join and participate in the Networking SIG to ensure these questions are addressed for YARN and other systems. |
HI. We have faced the exactly same problem as yours when training tensorflow on yarn. Do you have progress on it? |
@karmel as far as I can tell the old distributed TF corresponds to |
System information
Describe the problem
tf.train_and_evaluate
usestime.sleep
to (optimistically) synchronize the startup of chief/worker nodes in the distributed mode (see estimator/training.py). The implicit assumption in this logic is that by the time the worker is spawned, the chief has already started itstf.train.Server
, i.e. the scheduler should be aware of the assumption and should schedule and initialize the chief first. This might not be easily achievable on general purpose systems like YARN.One possible solution to this is to synchronize the chief/worker tasks on a barrier, and then preemptively start the server right after the barrier, but prior to calling
train_and_evaluate
. This does not eliminate the race condition entirely but makes it much less likely in practice. The only problem here is thattrain_and_evaluate
does not provide a documented way to account for preempted servers. The undocumented way is:tensorflow/tensorflow/python/estimator/training.py
Lines 747 to 748 in 2b4fd1c
I was wondering if it would be possible to make this part of the
train_and_evaluate
contract public or, alternatively, address the issue in another way?The text was updated successfully, but these errors were encountered: