-
Notifications
You must be signed in to change notification settings - Fork 74.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[XLA:GPU] Unroll small column reductions less, and use less shmem.
A column reduction has an unroll factor called `num_partial_results`. I think this is a misnomer -- I observe that with num_partial_results == 4, my kernel produces four *complete* (i.e. not partial) results per warp. This patch makes two changes. 1. We now unroll small column reductions less. Previously, we could get into a situation where, due to unrolling, we don't produce enough blocks to saturate the GPU. 2. We use a new codegen strategy for column reductions that uses less shared memory when the column reduction is unrolled. Previously we used a chunk of Nx33x32 elements where N is the unroll factor. But actually only one 33x32 block is live at a time, so this is N times larger than necessary. If we don't do (1) before doing (2), XLA takes advantage of the additional available shmem and unrolls small reductions even more, causing performance regressions! PiperOrigin-RevId: 538403704
- Loading branch information
1 parent
d27dc25
commit 42ea7ad
Showing
2 changed files
with
116 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters