Enable SO_REUSEPORT option in tensorflow training server #35383

jdlesage · 2019-12-24T10:45:53Z

System information

TensorFlow version (you are using): 1.15 and >=2.0
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Add SO_REUSEPORT option when starting tensorflow training server. It will enable to scan ports to build TF_CONFIG env variable. It is necessary to use distributed tensorflow with resource managers that do not reserve ports (such as Yarn).

It already has been discussed in ticket #21492. It is unclear why the option has been disabled in 8cf38e8

Will this change the current api? How?
No

Who will benefit with this feature?

Projects like https://github.com/criteo/tf-yarn (tensorflow on yarn) will use it to implement the recommended way to create the cluster configuration. (from https://www.tensorflow.org/guide/distributed_training): The procedure will be:

Launch on all executors a process that will scan ports and reserve a free one
A master gathers ports numbers.
Master creates configuration and broadcasts TF_CONFIG variable to all executors
Launch tensorflow servers

Any Other info.

The text was updated successfully, but these errors were encountered:

fhoering · 2020-01-02T10:45:26Z

The problem is more important now with the introduction of new distribution strategies like MultiWorkerMirroredStrategy.

Before with PS strategy we were able to start the tf server on our own:

 server = tf.train.Server(
            tf.train.ClusterSpec(cluster_spec),
            job_name=task_type,
            task_index=task_id,
            config=session_config,
            start=True)

and then just injecting 'google' env in here to prevent the server from starting again:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/distribute_coordinator.py#L432

This reduced the race condition of port reservation and therefore we never really saw the problem.

Now with MultiWorkerMirroredStrategy we can't start the server upfront anymore but need to start the whole strategy which then starts the server which leaves a longer delay between port reservation and the port really taken.
(See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/collective_all_reduce_strategy.py#L278)

@superbobry

yuefengz · 2020-01-06T07:31:19Z

I am curious why you want to start the servers upfront? Why not allow distribution strategies create std TF servers?

byronyi · 2020-01-06T08:29:55Z

Not being able to reserve a port is really a Yarn issue, not a TF one.

Currently TF server does not support clean shutdown, and as a workaround I'd suggest you to use a dummy server to probe available ports instead of creating and destroying TF servers repeatedly.

jdlesage · 2020-01-06T08:42:35Z

We currently use the solution described by @byronyi. But we encounter some race conditions with it. Because of the option SO_REUSE_PORT is not set, we have to shutdown the dummy server before to start the TF sever. As yarn cannot reserve ports, some other rogue processes can take the port before the tensorflow server. For this reason, we would like to activate this option to be sure the port will be reserved to tensorflow.

byronyi · 2020-01-06T08:44:25Z

@jdlesage I see what you mean by the race condition. Unfortunately, using SO_REUSEPORT does not solve this issue. Suppose two independent TF job bind to the same port with SO_REUSEPORT on a same machine. They thought they could reserve a port, but since both of them are using SO_REUSEPORT, neither of them will think they fail.

You have to start the servers upfront and propagate the cluster spec after the server initialization.

jdlesage · 2020-01-06T08:49:46Z

I was not aware that it is possible to propagate the cluster spec after the servers initialization. Definitively, that's the best solution. Do you have some pointers that describe how to propagate the cluster spec ?

byronyi · 2020-01-06T08:57:06Z

Take a look at this: #11081

Not sure how this will work with dist-strat though; I will leave the question to @yuefengz.

byronyi · 2020-01-06T08:57:50Z

Passing a ClusterResolver instead of TF_CONFIG could be helpful as well.

fhoering · 2020-01-06T10:25:02Z

Thanks for the replies.

@byronyi
I had a look at using ClusterResolver instead of TF_CONFIG but it seems that this doesn't solve this problem. From what I understood all it does is giving the spec in some form to dist strategies which then start the server again. So I suppose the race condition would be the same.

The code you linked to is nice and we actually use this for low level tensorflow in tf 1.15 but it doesn't seem to work anymore with tf 2 (at least all methods display as deprecated)

I see what you mean by the race condition. Unfortunately, using SO_REUSEPORT does not solve this issue. Suppose two independent TF job bind to the same port with SO_REUSEPORT on a same machine. They thought they could reserve a port, but since both of them are using SO_REUSEPORT, neither of them will think they fail.

I don't necessarily agree on this. We can have two tf jobs starting at the same time, yes, but the initial port assignment is random, so the probability that this collides is really small. Also in this case we could even code a port assignment service on our own and only reserve free ports.
With the current tf behavior we are blocked because we need to free the ports all the time (before starting the tf server) and then we don't even know when tf assigns them again.
With PS strategy this timeframe is really small, because we can start the server upfront, so we have this race condition but it is not a real issue.
With CollectiveAllReduce strategy (aka MultiWorkerMirroredStrategy) it is an issue because it seems not possible to start the server upfront anymore (@yuefengz can you confirm this statement is true ?)

So, imo, we would need two fixes:

re-activate SO_REUSEPORT
permit in every case to start the tf server upfront.

yuefengz · 2020-01-07T06:41:58Z

This is not supported for MultiWorkerMirroredStrategy to skip starting std servers. The server is started by the context object. @haoyuz probably can share a workaround if any.

oanush self-assigned this Dec 26, 2019

oanush added comp:apis Highlevel API related issues TF 1.15 for issues seen on TF 1.15 TF 2.0 Issues relating to TensorFlow 2.0 type:feature Feature requests labels Dec 26, 2019

oanush assigned ymodak and unassigned oanush Dec 26, 2019

ymodak added comp:dist-strat Distribution Strategy related issues and removed comp:apis Highlevel API related issues labels Jan 7, 2020

fhoering mentioned this issue Apr 3, 2020

use directly tf.Server to test ports availability criteo/tf-yarn#20

Open

rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 1, 2021

Saduf2019 assigned Saduf2019 and unassigned Saduf2019 Aug 16, 2021

tilakrayal removed TF 2.0 Issues relating to TensorFlow 2.0 TF 1.15 for issues seen on TF 1.15 labels Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SO_REUSEPORT option in tensorflow training server #35383

Enable SO_REUSEPORT option in tensorflow training server #35383

Enable SO_REUSEPORT option in tensorflow training server #35383

Enable SO_REUSEPORT option in tensorflow training server #35383

Comments