[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SO_REUSEPORT option in tensorflow training server #35383

Open
jdlesage opened this issue Dec 24, 2019 · 10 comments
Open

Enable SO_REUSEPORT option in tensorflow training server #35383

jdlesage opened this issue Dec 24, 2019 · 10 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests

Comments

@jdlesage
Copy link
Contributor

System information

  • TensorFlow version (you are using): 1.15 and >=2.0
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Add SO_REUSEPORT option when starting tensorflow training server. It will enable to scan ports to build TF_CONFIG env variable. It is necessary to use distributed tensorflow with resource managers that do not reserve ports (such as Yarn).

It already has been discussed in ticket #21492. It is unclear why the option has been disabled in 8cf38e8

Will this change the current api? How?
No

Who will benefit with this feature?

Projects like https://github.com/criteo/tf-yarn (tensorflow on yarn) will use it to implement the recommended way to create the cluster configuration. (from https://www.tensorflow.org/guide/distributed_training): The procedure will be:

  • Launch on all executors a process that will scan ports and reserve a free one
  • A master gathers ports numbers.
  • Master creates configuration and broadcasts TF_CONFIG variable to all executors
  • Launch tensorflow servers

Any Other info.

@oanush oanush self-assigned this Dec 26, 2019
@oanush oanush added comp:apis Highlevel API related issues TF 1.15 for issues seen on TF 1.15 TF 2.0 Issues relating to TensorFlow 2.0 type:feature Feature requests labels Dec 26, 2019
@oanush oanush assigned ymodak and unassigned oanush Dec 26, 2019
@fhoering
Copy link
fhoering commented Jan 2, 2020

The problem is more important now with the introduction of new distribution strategies like MultiWorkerMirroredStrategy.

Before with PS strategy we were able to start the tf server on our own:

 server = tf.train.Server(
            tf.train.ClusterSpec(cluster_spec),
            job_name=task_type,
            task_index=task_id,
            config=session_config,
            start=True)

and then just injecting 'google' env in here to prevent the server from starting again:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/distribute_coordinator.py#L432

This reduced the race condition of port reservation and therefore we never really saw the problem.

Now with MultiWorkerMirroredStrategy we can't start the server upfront anymore but need to start the whole strategy which then starts the server which leaves a longer delay between port reservation and the port really taken.
(See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/collective_all_reduce_strategy.py#L278)

@superbobry

@yuefengz
Copy link
Contributor
yuefengz commented Jan 6, 2020

I am curious why you want to start the servers upfront? Why not allow distribution strategies create std TF servers?

@byronyi
Copy link
Contributor
byronyi commented Jan 6, 2020

Not being able to reserve a port is really a Yarn issue, not a TF one.

Currently TF server does not support clean shutdown, and as a workaround I'd suggest you to use a dummy server to probe available ports instead of creating and destroying TF servers repeatedly.

@jdlesage
Copy link
Contributor Author
jdlesage commented Jan 6, 2020

We currently use the solution described by @byronyi. But we encounter some race conditions with it. Because of the option SO_REUSE_PORT is not set, we have to shutdown the dummy server before to start the TF sever. As yarn cannot reserve ports, some other rogue processes can take the port before the tensorflow server. For this reason, we would like to activate this option to be sure the port will be reserved to tensorflow.

@byronyi
Copy link
Contributor
byronyi commented Jan 6, 2020

@jdlesage I see what you mean by the race condition. Unfortunately, using SO_REUSEPORT does not solve this issue. Suppose two independent TF job bind to the same port with SO_REUSEPORT on a same machine. They thought they could reserve a port, but since both of them are using SO_REUSEPORT, neither of them will think they fail.

You have to start the servers upfront and propagate the cluster spec after the server initialization.

@jdlesage
Copy link
Contributor Author
jdlesage commented Jan 6, 2020

I was not aware that it is possible to propagate the cluster spec after the servers initialization. Definitively, that's the best solution. Do you have some pointers that describe how to propagate the cluster spec ?

@byronyi
Copy link
Contributor
byronyi commented Jan 6, 2020

Take a look at this: #11081

Not sure how this will work with dist-strat though; I will leave the question to @yuefengz.

@byronyi
Copy link
Contributor
byronyi commented Jan 6, 2020

Passing a ClusterResolver instead of TF_CONFIG could be helpful as well.

@fhoering
Copy link
fhoering commented Jan 6, 2020

Thanks for the replies.

@byronyi
I had a look at using ClusterResolver instead of TF_CONFIG but it seems that this doesn't solve this problem. From what I understood all it does is giving the spec in some form to dist strategies which then start the server again. So I suppose the race condition would be the same.

The code you linked to is nice and we actually use this for low level tensorflow in tf 1.15 but it doesn't seem to work anymore with tf 2 (at least all methods display as deprecated)

I see what you mean by the race condition. Unfortunately, using SO_REUSEPORT does not solve this issue. Suppose two independent TF job bind to the same port with SO_REUSEPORT on a same machine. They thought they could reserve a port, but since both of them are using SO_REUSEPORT, neither of them will think they fail.

I don't necessarily agree on this. We can have two tf jobs starting at the same time, yes, but the initial port assignment is random, so the probability that this collides is really small. Also in this case we could even code a port assignment service on our own and only reserve free ports.
With the current tf behavior we are blocked because we need to free the ports all the time (before starting the tf server) and then we don't even know when tf assigns them again.
With PS strategy this timeframe is really small, because we can start the server upfront, so we have this race condition but it is not a real issue.
With CollectiveAllReduce strategy (aka MultiWorkerMirroredStrategy) it is an issue because it seems not possible to start the server upfront anymore (@yuefengz can you confirm this statement is true ?)

So, imo, we would need two fixes:

  • re-activate SO_REUSEPORT
  • permit in every case to start the tf server upfront.

@yuefengz
Copy link
Contributor
yuefengz commented Jan 7, 2020

This is not supported for MultiWorkerMirroredStrategy to skip starting std servers. The server is started by the context object. @haoyuz probably can share a workaround if any.

@ymodak ymodak added comp:dist-strat Distribution Strategy related issues and removed comp:apis Highlevel API related issues labels Jan 7, 2020
@rmothukuru rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 1, 2021
@Saduf2019 Saduf2019 assigned Saduf2019 and unassigned Saduf2019 Aug 16, 2021
@tilakrayal tilakrayal removed TF 2.0 Issues relating to TensorFlow 2.0 TF 1.15 for issues seen on TF 1.15 labels Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

9 participants