-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SO_REUSEPORT option in tensorflow training server #35383
Comments
The problem is more important now with the introduction of new distribution strategies like MultiWorkerMirroredStrategy. Before with PS strategy we were able to start the tf server on our own:
and then just injecting 'google' env in here to prevent the server from starting again: This reduced the race condition of port reservation and therefore we never really saw the problem. Now with MultiWorkerMirroredStrategy we can't start the server upfront anymore but need to start the whole strategy which then starts the server which leaves a longer delay between port reservation and the port really taken. |
I am curious why you want to start the servers upfront? Why not allow distribution strategies create std TF servers? |
Not being able to reserve a port is really a Yarn issue, not a TF one. Currently TF server does not support clean shutdown, and as a workaround I'd suggest you to use a dummy server to probe available ports instead of creating and destroying TF servers repeatedly. |
We currently use the solution described by @byronyi. But we encounter some race conditions with it. Because of the option SO_REUSE_PORT is not set, we have to shutdown the dummy server before to start the TF sever. As yarn cannot reserve ports, some other rogue processes can take the port before the tensorflow server. For this reason, we would like to activate this option to be sure the port will be reserved to tensorflow. |
@jdlesage I see what you mean by the race condition. Unfortunately, using SO_REUSEPORT does not solve this issue. Suppose two independent TF job bind to the same port with SO_REUSEPORT on a same machine. They thought they could reserve a port, but since both of them are using SO_REUSEPORT, neither of them will think they fail. You have to start the servers upfront and propagate the cluster spec after the server initialization. |
I was not aware that it is possible to propagate the cluster spec after the servers initialization. Definitively, that's the best solution. Do you have some pointers that describe how to propagate the cluster spec ? |
Passing a ClusterResolver instead of TF_CONFIG could be helpful as well. |
Thanks for the replies. @byronyi The code you linked to is nice and we actually use this for low level tensorflow in tf 1.15 but it doesn't seem to work anymore with tf 2 (at least all methods display as deprecated)
I don't necessarily agree on this. We can have two tf jobs starting at the same time, yes, but the initial port assignment is random, so the probability that this collides is really small. Also in this case we could even code a port assignment service on our own and only reserve free ports. So, imo, we would need two fixes:
|
This is not supported for |
System information
Describe the feature and the current behavior/state.
Add SO_REUSEPORT option when starting tensorflow training server. It will enable to scan ports to build TF_CONFIG env variable. It is necessary to use distributed tensorflow with resource managers that do not reserve ports (such as Yarn).
It already has been discussed in ticket #21492. It is unclear why the option has been disabled in 8cf38e8
Will this change the current api? How?
No
Who will benefit with this feature?
Projects like https://github.com/criteo/tf-yarn (tensorflow on yarn) will use it to implement the recommended way to create the cluster configuration. (from https://www.tensorflow.org/guide/distributed_training): The procedure will be:
Any Other info.
The text was updated successfully, but these errors were encountered: