-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression from commit 7dcd870 #22683
Comments
cc @gante and @ArthurZucker |
@fpgaminer commit 7dcd870 fixes generation when there is padding in the input (which is almost always the case for We don't optimize for performance but rather for correctness. To skip this gathering while remaining correct, Meanwhile, is there anything else we can help you with? |
That's fair, though a 10% performance hit is rather painful. To that end, here's my attempt to optimize
Output:
For reference, the pre 7dc870 function runs in 0.030ms on 1x1, so this isn't quite as fast but gets closer. Would a pull request with this change be welcome? I've done my best to verify its correctness with the above code. |
@fpgaminer that is great! Absolutely, a PR would be very welcome 🙌 (We'd be happy to integrate other optimization opportunities if you spot them, we rarely have the bandwidth to optimize our modeling code) |
Maybe there's something I'm not seeing here but Llama uses rotary positional embeddings so left padding should have no effect on the result? Sure, the intermediate result from Or are you saying there are cases when padding is literally inserted inside of the sequence, therefore changing the relative distances between tokens, @gante? |
@aljungberg I agree with everything you wrote, rotary positional embeddings should be position-invariant. In practice, the small rounding errors compound over autoregressive text generation, leading greedy decoding (which is normally invariant wrt small fluctuations) to produce different text. With the right position index, the error becomes much smaller, and the results become more stable regardless of padding. That's why we also added it to our high-performance text generation repo, despite the difference being quite small. Out of curiosity, this test was failing on GPTNeoX and Llama before we added this change. In theory, it shouldn't have failed at all! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.28.0.dev0 (656e869)Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have a benchmark script which benchmarks the generation speed of different LLaMA models. Before commit 7dcd870 my generation speed averaged around 48 tokens/s in ideal cases, RTX 3090. After that commit the average speed is 43 tokens/s.
The specific issue seems to be the change to
apply_rotary_pos_emb
. My guess is the change from a rather simple slicing of two Tensors to a scatter-gather.To test my theory I patched
apply_rotary_pos_emb
to its pre 7dcd870 state, and minimally modifiedLlamaAttention
accordingly. No other modifications. Speed jumped back to 48 tokens/s.The problem should apply generally, but the specific script I'm using is: https://github.com/fpgaminer/GPTQ-triton/blob/99ec4a3adb7fad9de33ff026bbfb64cbb3bab2f8/benchmark_generate.py
Expected behavior
I would not expect a 10% drop in performance.
The text was updated successfully, but these errors were encountered: