refactor sampling layer setup #1912

irexyc · 2024-07-03T12:56:14Z

Motivation

setup sampling layer after forward to increase parallelism

test with llama-2-7b-chat and 1000 num_prompts

before

--------------------------------------------------
concurrency: 128
elapsed_time: 79.527s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 3025.156 token/s
token throughput (prompt + completion token): 6147.852 token/s
RPS (request per second): 12.574 req/s
RPM (request per minute): 754.460 req/min
--------------------------------------------------

after

--------------------------------------------------
concurrency: 128
elapsed_time: 78.886s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 3049.727 token/s
token throughput (prompt + completion token): 6197.786 token/s
RPS (request per second): 12.676 req/s
RPM (request per minute): 760.587 req/min
--------------------------------------------------

zhyncs · 2024-07-03T16:23:23Z

src/turbomind/models/llama/LlamaBatch.cc

@@ -1131,6 +1134,7 @@ void LlamaBatch<T>::InitializeSampling(const GenerationState& g)
        }
    }
    outputs_ = std::move(outputs);
+    sync_check_cuda_error();


Why should we cudaDeviceSynchronize here?

It will sync only when TM_DEBUG_LEVEL=DEBUG, I add here to better calculate the time

zhyncs

6 unit tests have failed and may need to be fixed.

[  FAILED  ] 6 tests, listed below:
[  FAILED  ] SamplingDecodeTest/0.TopP, where TypeParam = float
[  FAILED  ] SamplingDecodeTest/0.BatchTopP, where TypeParam = float
[  FAILED  ] SamplingDecodeTest/0.InvalidArgsZeroTopP, where TypeParam = float
[  FAILED  ] SamplingDecodeTest/0.InvalidArgsBatchTopPContainZero, where TypeParam = float
[  FAILED  ] SamplingDecodeTest/0.InvalidArgsTopKBatchTopPContainZero, where TypeParam = float
[  FAILED  ] SamplingDecodeTest/0.LocalBatchBatchTopP, where TypeParam = float

 6 FAILED TESTS

src/turbomind/layers/sampling_layers/TopPSamplingLayer.cu

src/turbomind/layers/sampling_layers/TopPSamplingLayer.h

zhyncs · 2024-07-04T06:53:41Z

src/turbomind/layers/sampling_layers/TopPSamplingLayer.cu

+    uint         min_top_k          = runtime_top_k_size > 0 ? runtime_top_k.min<uint>() : 0;
+    skip_all_                       = false;
+
+    if (runtime_top_p_size == 0 || min_top_k > 0) {


Why skip when min_top_k > 0?

ref https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text

The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.
If the model returns a response that's too generic, too short, or the model gives a fallback response, try increasing the temperature.

Top-K changes how the model selects tokens for output. A top-K of 1 means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of 3 means that the next token is selected from among the three most probable tokens by using temperature.
For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.
Specify a lower value for less random responses and a higher value for more random responses.

Top-P changes how the model selects tokens for output. Tokens are selected from the most (see top-K) to least probable until the sum of their probabilities equals the top-P value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-P value is 0.5, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.
Specify a lower value for less random responses and a higher value for more random responses.

highlight: Then tokens are further filtered based on top-P with the final token selected using temperature sampling.

So I think when min_top_k > 0, we can also use top p sampling.

@lzhangzz @irexyc @lvhan028 Do you have any suggestions?

The topk is fused kernel and wil do topp sampling.

If top_k > 0, chose topk sampling layer, otherwise chose topp sampling layer.

zhyncs

LGTM

src/turbomind/layers/sampling_layers/TopKSamplingLayer.cc

lvhan028 · 2024-07-09T08:09:38Z

src/turbomind/models/llama/LlamaBatch.cc


    if (!skip_init_sampling) {
        g.max_init_ctx_len = max_context_len;
        g.step             = max_context_len;
-        InitializeSampling(g);


Is there any necessity that we move intializeSampling from Initialize to Forward?

lvhan028 · 2024-07-09T08:16:13Z

src/turbomind/layers/sampling_layers/TopKSamplingLayer.cc

+    float        min_top_p          = runtime_top_p_size > 0 ? runtime_top_p.min<float>() : 0.0f;
+    skip_all_                       = false;
+
+    if (max_top_k == 0 && min_top_p != 0.0f) {


what does max_top_k==0 && min_top_p==0.0f indicate?

I get it in setup_topk_runtime_args

What's the relation between top_k and n_logprobs? Is n_logprobs not greater than top_k? I just wonder if skip_allaffects returninglogitsorlogprobs`

lvhan028 · 2024-07-09T08:25:11Z

src/turbomind/layers/sampling_layers/TopKSamplingLayer.cc


-    uint  top_k = runtime_top_k.max<uint>();
-    float top_p = runtime_top_p_size == 0 ? 0.0f : runtime_top_p.getVal<float>();
+    // skip topk setup & forward if all top_k is zero and all top_p is not zero


Should we move this comment before if (max_top_k == 0 && min_top_p != 0.0f)?

lvhan028 · 2024-07-09T09:10:17Z

LGTM.
But I'd like to initiate a discussion: can we only maintain the BaseSamplingLayer? This consideration arises since I have observed a significant amount of duplicated code between TopKSamplingLayer and TopPSamplingLayer

irexyc · 2024-07-09T09:13:22Z

But I'd like to initiate a discussion: can we only maintain the BaseSamplingLayer?

I write some suggestions in #1966

irexyc added 5 commits July 2, 2024 10:13

refactor sampling setup

4ada31c

remove sync after cudaFreeAsync

cd4bdd6

remove unused

e425739

skip sample layer setup if possible

80a2c40

update

6f81e84

zhyncs reviewed Jul 3, 2024

View reviewed changes

fix unittest

0b78ce4

zhyncs reviewed Jul 4, 2024

View reviewed changes

src/turbomind/layers/sampling_layers/TopPSamplingLayer.cu Show resolved Hide resolved

zhyncs reviewed Jul 4, 2024

View reviewed changes

src/turbomind/layers/sampling_layers/TopPSamplingLayer.cu Show resolved Hide resolved

zhyncs reviewed Jul 4, 2024

View reviewed changes

src/turbomind/layers/sampling_layers/TopPSamplingLayer.h Outdated Show resolved Hide resolved

fix naming

821e3eb

zhyncs reviewed Jul 4, 2024

View reviewed changes

irexyc added 2 commits July 4, 2024 07:21

rename file name

f91b1e7

remove unused

f8be204

zhyncs approved these changes Jul 4, 2024

View reviewed changes

lvhan028 requested review from lvhan028 and lzhangzz July 4, 2024 08:36

lvhan028 added the improvement label Jul 4, 2024

zhyncs reviewed Jul 4, 2024

View reviewed changes

src/turbomind/layers/sampling_layers/TopKSamplingLayer.cc Outdated Show resolved Hide resolved

fix

81b7a72

zhyncs mentioned this pull request Jul 4, 2024

[Feature] diff tool for troubleshooting #1908

Closed

zhyncs reviewed Jul 4, 2024

View reviewed changes

src/turbomind/layers/sampling_layers/TopKSamplingLayer.cc Show resolved Hide resolved

remove unused

fb9a777

lvhan028 reviewed Jul 9, 2024

View reviewed changes

lvhan028 approved these changes Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor sampling layer setup #1912

refactor sampling layer setup #1912

refactor sampling layer setup #1912

Are you sure you want to change the base?

refactor sampling layer setup #1912

Conversation

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment