Support synchronous training with parameter servers using Distribution Strategies #27725

yuefengz · 2019-04-10T21:31:38Z

System information

TensorFlow version (you are using): TF 2.0
Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
Right now we have MirroredStrategy and MultiWorkerMirroredStrategy for synchronous training where variables and ops are all replicated and all-reduce is used for gradient aggregation. We also have ParameterServerStrategy where gradient updates from workers are purely asynchronous since SyncReplicasOptimizer is buggy and has been deprecated.

We would like to first collect use cases where synchronous training with MirroredStrategy and MultiWorkerMirroredStrategy is not ideal and synchronous training with parameter servers is necessary.

If this feature is necessary and important enough, we will then use this issue to track the progress of the development of this feature.

We have a separate feature request to support large embeddings with MirroredStrategy and MultiWorkerMirroredStrategy: #27726

Will this change the current api? How?
Yes.

Who will benefit with this feature?
Those who use distributed training.

Any Other info.
N/A

The text was updated successfully, but these errors were encountered:

shendiaomo · 2019-09-25T15:23:04Z

What's the progress of this issue?

liyi193328 · 2019-10-17T07:27:08Z

What's the progress of this issue?

neuzxy · 2020-02-21T01:52:12Z

This feature is promising. What's the progress of this issue? @yuefengz

yuefengz · 2020-02-21T07:34:40Z

We will send out a RFC for PSStrategy soon.

liyi193328 · 2020-03-22T15:43:40Z

There is any plan? Thanks @yuefengz

sahiltyagi4 · 2020-07-18T18:40:58Z

@yuefengz any developments on this ?

Mesilenceki · 2022-01-13T09:35:46Z

hi , Is there some progress with this issue?

yuefengz self-assigned this Apr 10, 2019

josh11b added comp:dist-strat Distribution Strategy related issues type:feature Feature requests labels Apr 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support synchronous training with parameter servers using Distribution Strategies #27725

Support synchronous training with parameter servers using Distribution Strategies #27725

Support synchronous training with parameter servers using Distribution Strategies #27725

Support synchronous training with parameter servers using Distribution Strategies #27725

Comments