Need MPMD supporting for GPU to use pipeline parallelism training large scale model #62736

MoFHeka · 2024-01-04T16:59:35Z

Issue type

Feature Request

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

tf 2.15

Custom code

No

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Nowadays pipeline parallelism has been implemented in PyTorch for a long time. It's very useful for training a CTR model with Embedding pipeline or training a large language model between two machine.

Standalone code to reproduce the issue

Maybe a easy send/recv construction like tpu with xla?

Relevant log output

No response

SuryanarayanaY · 2024-01-05T11:02:09Z

Hi @MoFHeka ,

Tensorflow supports Data Parallelism only now. Model parallelism is yet to support fully for training. But with help of Dtensors we can achieve both data and model parallelism as per attached tutorial.

As per my understanding of pipeline parallelism its hybrid approach of data and model parallelism.

MoFHeka · 2024-01-05T11:45:28Z

@SuryanarayanaY But apparently DTensor can't reach pipeline parallelism. Pipeline parallelism was based on sending and receiving collective operator.
As tensorflow/compiler/xla/hlo/experimental/auto_sharding/auto_sharding.cc show, it seems that ZeRO stage 3 and pipeline parallel were already supported by XLA.

dathudeptrai · 2024-01-06T08:24:07Z

@MoFHeka I always felt like the XLA and Tensorflow/Jax actually support many hidden features but never mention or write the document for it :)).

MoFHeka · 2024-01-06T15:38:31Z

@dathudeptrai I really agree with that. Large projects often lead to difficulties in project management.

SuryanarayanaY · 2024-01-08T06:34:04Z

As tensorflow/compiler/xla/hlo/experimental/auto_sharding/auto_sharding.cc show, it seems that ZeRO stage 3 and pipeline parallel were already supported by XLA.

Hi @MoFHeka ,

I doubt it, correct me if I am wrong. I can see from Tf2.14v code auto_sharding.cc XLA supports SPMD which is for data parallelism which is supported by TF. Can you point exactly which part you are referring to that you feel that XLA supports Model parallelism or MPMD. This may help us to escalate the issue to SME and get confirmation. Thanks!

MoFHeka · 2024-01-08T08:09:45Z

As tensorflow/compiler/xla/hlo/experimental/auto_sharding/auto_sharding.cc show, it seems that ZeRO stage 3 and pipeline parallel were already supported by XLA.

Hi @MoFHeka ,

I doubt it, correct me if I am wrong. I can see from Tf2.14v code auto_sharding.cc XLA supports SPMD which is for data parallelism which is supported by TF. Can you point exactly which part you are referring to that you feel that XLA supports Model parallelism or MPMD. This may help us to escalate the issue to SME and get confirmation. Thanks!

@SuryanarayanaY Please check these comments. It said "This can result in a strategy similar to ZeRO stage 3. NOTE: The combination of this branch with pipeline parallel is not tested."

tensorflow/tensorflow/compiler/xla/hlo/experimental/auto_sharding/auto_sharding.cc

Line 3249 in 99d80a9

// NOTE: The combination of this branch with pipeline parallel is not

And please check here, TPU already support MPMD for a long time. It said "If any of the inputs/outputs have maximal sharding, then fallback to MPMD. "

tensorflow/tensorflow/compiler/mlir/tensorflow/transforms/tpu_sharding_identification_pass.cc

Line 580 in d032157

// maximal sharding, then fallback to MPMD. Also fall back if any of the

dathudeptrai · 2024-01-15T11:17:19Z

@MoFHeka I suggest you use jax instead. SOmething like this or flash-attention, FP8 training, int8 training, ... all available in jax with support from XLA (natively). TF also used XLA but kinda hard to custom.

MoFHeka · 2024-01-15T13:57:06Z

@MoFHeka I suggest you use jax instead. SOmething like this or flash-attention, FP8 training, int8 training, ... all available in jax with support from XLA (natively). TF also used XLA but kinda hard to custom.

@dathudeptrai Unfortunately, JAX also doesn't support many features, such as sequence parallelism. And at the XLA level, even with JAX, many features are actually only supported by TPU.
One more important thing, I can't train the CTR model with JAX, which lacks too many things.

dathudeptrai · 2024-01-15T13:58:35Z

@MoFHeka NVIDIA/TransformerEngine#602

MoFHeka · 2024-01-15T16:19:05Z

@MoFHeka NVIDIA/TransformerEngine#602

@dathudeptrai This really surprised me. I always thought it was difficult to split the segmentation of sequence dimension in JAX sharding process.
But there is one thing I am not sure about, if I am going to use pipeline parallel training LLM with JAX, should I use a ray engine like alpa or a JAX native one? JAX doesn't seem to have a good software library that supports all accelerations right now.

dathudeptrai · 2024-01-15T16:23:03Z

@MoFHeka Yeah. Generally speaking, coding in jax is harder than pytorch and a bit easier than TF. About low level customization, I think jax is better. Performance wise in my experiments showed that jax is better than pytorch :). Even deepspeed + Flash-attention-2 + Pytorch still not as good as jax :)).

You can refer some opensource to see how you can custom the paralelism training in jax. https://github.com/alpa-projects/alpa, this I called Deepspeed for Jax :).

MoFHeka · 2024-01-16T02:59:05Z

@dathudeptrai Unfortunately, due to the lack of sequence parallelism, the compute utilization of Alpa is lower than that of Megatron with the same tensor parallelism optimization. Because Alpa use too much device memory when using TP, which leads to a smaller batch size.

Besides, the CTR models really can't be trained with Jax. The Jax ecosystem of online services, data processing, and other components (such as Keras) is way too far behind TF.

dathudeptrai · 2024-01-16T03:02:04Z

@MoFHeka Why not use both TF and Jax at the same time :), you can call jax code in TF code nowadays. Also please check out some advanced attention techniques recently introduced in jax (https://github.com/lhao499/large-sequence-modeling/tree/main).

I personally think the biggest problem with both TF and Jax is documentations :)).

MoFHeka · 2024-01-16T03:12:57Z

@MoFHeka Why not use both TF and Jax at the same time :), you can call jax code in TF code nowadays. Also please check out some advanced attention techniques recently introduced in jax (https://github.com/lhao499/large-sequence-modeling/tree/main).

@dathudeptrai Yes, that's right, the Jax kernel can be used in TF code, although there’s no big difference between Jax kernels and Keras layers with XLA.
But the problem is that the pipeline parallelism capabilities of JAX cannot be used in TF. TF currently lacks pipeline parallelism components.

Even in recent updates, DTensor is used to support tensor parallelism. But in the training of the CTR model, one of the biggest usage scenarios of TF, what is more needed is the ability of pipeline parallelism.

MoFHeka · 2024-01-19T07:21:08Z

@SuryanarayanaY Hi~? Is there any way to implement pipeline training in tensorflow?
'tf.distribute.experimental.rpc.Server' with 'server.register' looks like a good choice, but I'm not sure.

MoFHeka · 2024-06-19T01:58:32Z

@SuryanarayanaY Any progress? Tensorflow seems to be way behind in its competition with Pytorch...

google-ml-butler bot added the type:feature Feature requests label Jan 4, 2024

google-ml-butler bot assigned SuryanarayanaY Jan 4, 2024

SuryanarayanaY added TF 2.15 For issues related to 2.15.x comp:dist-strat Distribution Strategy related issues labels Jan 5, 2024

SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need MPMD supporting for GPU to use pipeline parallelism training large scale model #62736

Need MPMD supporting for GPU to use pipeline parallelism training large scale model #62736

Need MPMD supporting for GPU to use pipeline parallelism training large scale model #62736

Need MPMD supporting for GPU to use pipeline parallelism training large scale model #62736

Comments

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output