[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

ymjiang · 2021-03-31T08:10:57Z

Hi TF developers,

This issue intends to get more knowledge about the ongoing and future plan of XLA PjRT, especially on the distributed runtime of GPU.

Although there are some related commits (e.g., 44e771a and 8a72c44), existing PjRT code seems quite preliminary in supporting distributed training on multiple hosts -- As far as I could see, it now only provides basic control operations such as setting up the connections and sharing GPU topology, and it is not trivial to run a real distributed demo with communication. Another issue google/jax#2731 seems also aligns with this point.

So I would like to confirm if TF team is also working on this and may probably release some new features soon. For now, our interest is on using PjRT to run HLO with collective or P2P communication, possibly including all-reduce, all-gather, reduce-scatter, send/recv, etc.

Any response will be appreciated. Thanks.

The text was updated successfully, but these errors were encountered:

bhack · 2022-12-10T12:06:13Z

/cc @joker-eph

ymjiang added the type:others issues not falling in bug, perfromance, support, build and install or feature label Mar 31, 2021

google-ml-butler bot assigned amahendrakar Mar 31, 2021

amahendrakar added the comp:xla XLA label Mar 31, 2021

amahendrakar assigned jvishnuvardhan and unassigned amahendrakar Mar 31, 2021

jvishnuvardhan assigned r4nt and unassigned jvishnuvardhan Mar 31, 2021

jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests and removed type:others issues not falling in bug, perfromance, support, build and install or feature labels Mar 31, 2021

merrymercy mentioned this issue Apr 10, 2021

Run Jax on multiple hosts alpa-projects/alpa#4

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210

Comments