[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

Open
ymjiang opened this issue Mar 31, 2021 · 1 comment
Open
Assignees
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests

Comments

@ymjiang
Copy link
ymjiang commented Mar 31, 2021

Hi TF developers,

This issue intends to get more knowledge about the ongoing and future plan of XLA PjRT, especially on the distributed runtime of GPU.

Although there are some related commits (e.g., 44e771a and 8a72c44), existing PjRT code seems quite preliminary in supporting distributed training on multiple hosts -- As far as I could see, it now only provides basic control operations such as setting up the connections and sharing GPU topology, and it is not trivial to run a real distributed demo with communication. Another issue google/jax#2731 seems also aligns with this point.

So I would like to confirm if TF team is also working on this and may probably release some new features soon. For now, our interest is on using PjRT to run HLO with collective or P2P communication, possibly including all-reduce, all-gather, reduce-scatter, send/recv, etc.

Any response will be appreciated. Thanks.

@ymjiang ymjiang added the type:others issues not falling in bug, perfromance, support, build and install or feature label Mar 31, 2021
@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests and removed type:others issues not falling in bug, perfromance, support, build and install or feature labels Mar 31, 2021
@bhack
Copy link
Contributor
bhack commented Dec 10, 2022

/cc @joker-eph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:xla XLA stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

5 participants