[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime #48210
Labels
comp:xla
XLA
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
type:feature
Feature requests
Hi TF developers,
This issue intends to get more knowledge about the ongoing and future plan of XLA PjRT, especially on the distributed runtime of GPU.
Although there are some related commits (e.g., 44e771a and 8a72c44), existing PjRT code seems quite preliminary in supporting distributed training on multiple hosts -- As far as I could see, it now only provides basic control operations such as setting up the connections and sharing GPU topology, and it is not trivial to run a real distributed demo with communication. Another issue google/jax#2731 seems also aligns with this point.
So I would like to confirm if TF team is also working on this and may probably release some new features soon. For now, our interest is on using PjRT to run HLO with collective or P2P communication, possibly including all-reduce, all-gather, reduce-scatter, send/recv, etc.
Any response will be appreciated. Thanks.
The text was updated successfully, but these errors were encountered: