Implement and extend data-specific cartesian index #1902

charleskawczynski · 2024-07-24T20:29:14Z

This PR extends and is a generalization of #1900. Supersedes #1900.

This PR effectively fixes the current issue, where we have uncoalesced reads/writes in our pointwise broadcast expressions.

The fix was to specialize on broadcast expressions where we don't need to use a universal cartesian index for all datalayouts, but instead use a datalayout-specific index (defined as DataSpecificCartesianIndex).

Then, we define getindex/setindex on datalayouts for this index type, and do not swap the indices, as we've launch the kernel with CartesianIndices corresponding to size(parent(data)) (as opposed to size(data) for the generalized case).

The result is that all data is accessed in a contiguous / coalesced fashion.

charleskawczynski · 2024-07-25T13:44:57Z

On the Clima node, here's how the benchmark copyto! for pointwise broadcast expressions change:

Main

Benchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  116.978 μs …  3.014 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     120.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.499 μs ± 29.288 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▁▂▃▄▅▅▇█▅▇▇▇▆▅▆▆▇▄▄▄▂▁▁                     
  ▁▁▁▁▁▁▁▂▂▂▂▃▄▅▆█▇██████████████████████████▆▆▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂ ▅
  117 μs          Histogram: frequency by time          125 μs <

 Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  112.709 μs …  3.142 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     114.778 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   115.419 μs ± 30.815 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▄▄▄▇█▆▄▄▂▁                                      
  ▁▁▁▁▂▂▂▂▃▃▄▆▆███████████▇▆▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  113 μs          Histogram: frequency by time          119 μs <

 Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary:       | Pass  Total   Time
copyto! with Nf = 1 |    1      1  29.2s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913773875106e9, 1.721913803097613e9, false, "REPL[9]")

This PR

Benchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  64.350 μs …  2.607 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     67.989 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   68.561 μs ± 27.209 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▄▆▆▅▅▃▄▄▆▄▅▆██▆▆▅▄▄▅▄▂▁▁▁▁                    
  ▁▁▂▃▃▄▄▅▅▄▄▄▄▆███████████████████████████▇▆▆▅▅▅▄▄▃▃▃▃▂▂▂▁▁▁ ▅
  64.4 μs         Histogram: frequency by time        72.2 μs <

 Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  111.759 μs …  4.475 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.349 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.971 μs ± 43.903 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▄▅▇▇██▇▇▅▄▂▁                                      
  ▁▁▁▁▁▂▂▄▅▆██████████████▇▆▅▄▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  112 μs          Histogram: frequency by time          117 μs <

 Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary:       | Pass  Total   Time
copyto! with Nf = 1 |    1      1  28.5s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913933078668e9, 1.721913961585961e9, false, "REPL[9]")

From a hardware performance POV, we'll now have:

julia> function bandwidth_efficiency(;
           problem_size=(63,4,4,1,5400),
           float_type,
           device_bandwidth_GBs=2_039, # clima A100 specs: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
           kernel_time_s,
           n_reads_writes
         )
         N = prod(problem_size)
         GB = N*n_reads_writes*sizeof(float_type)/1024^3
         achieved_bandwidth_GBs = GB/kernel_time_s
         percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
         @info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
       end
bandwidth_efficiency (generic function with 1 method)

julia> bandwidth_efficiency(; # main (with index swapping)
           float_type=Float64, kernel_time_s=(116.978)/10^6, n_reads_writes=2
       )
┌ Info: Bandwidth info
│   problem_size = (63, 4, 4, 1, 5400)
│   float_type = Float64
│   achieved_bandwidth_GBs = 693.3782472802712
│   device_bandwidth_GBs = 2039
└   percent_efficiency = 34.00579927809079

julia> bandwidth_efficiency(; # this PR (no index swapping)
           float_type=Float64, kernel_time_s=(64.350)/10^6, n_reads_writes=2
       )
┌ Info: Bandwidth info
│   problem_size = (63, 4, 4, 1, 5400)
│   float_type = Float64
│   achieved_bandwidth_GBs = 1260.4506699355334
│   device_bandwidth_GBs = 2039
└   percent_efficiency = 61.817100045881965

From https://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf, slide 35:

70-80% is very good

So I'd say that this PR puts us in the "good" range for pointwise broadcast expressions, 🎉. cc @tapios

Next, we should fix the stencil kernels. That will be more tricky since the indexing is split up, and I think some of those may use the non-cartesian indexing. But we should see similar speedups there.

For future self: the reason that we're faster than ordinary CuArrays, is because our datalayouts now know completely static sizes, and offsets can be determined at compile time. For CuArrays, the sizes are not compile-time known.

src/DataLayouts/cartesian_index.jl

test/runtests.jl

src/DataLayouts/DataLayouts.jl

src/DataLayouts/cartesian_index.jl

src/DataLayouts/has_uniform_datalayouts.jl

Sbozzolo

LGTM, but I'd add a comment about the need for redefining the broadcast machinery before merging (as you did in the comment).

Extend data-specific cartesian index

charleskawczynski · 2024-08-15T16:59:52Z

Just to cycle back on this PR: It turns out that this solution is pretty effective for Float64, but not so much for Float32 (xref)

charleskawczynski added the performance label Jul 24, 2024

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch 2 times, most recently from b41eb05 to cd17c13 Compare July 25, 2024 13:19

charleskawczynski marked this pull request as ready for review July 25, 2024 13:23

charleskawczynski changed the title ~~Extend data-specific cartesian ind~~ Implement and extend data-specific cartesian index Jul 25, 2024

charleskawczynski requested review from Sbozzolo, dennisYatunin and sriharshakandala July 25, 2024 13:48

charleskawczynski commented Jul 25, 2024

View reviewed changes

src/DataLayouts/cartesian_index.jl Show resolved Hide resolved

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch from cd17c13 to 1cd4f4f Compare July 25, 2024 14:08

This was referenced Jul 25, 2024

Implement data-specific cartesian index #1900

Closed

Use launch_configuration for VF copyto kernel #1859

Closed

Add index-swapping benchmark script #1905

Merged

Sbozzolo reviewed Jul 26, 2024

View reviewed changes

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch from 1cd4f4f to 476a184 Compare July 29, 2024 12:46

Sbozzolo reviewed Jul 29, 2024

View reviewed changes

src/DataLayouts/has_uniform_datalayouts.jl Outdated Show resolved Hide resolved

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch from 476a184 to 2c15143 Compare July 29, 2024 15:22

Sbozzolo approved these changes Jul 30, 2024

View reviewed changes

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch 2 times, most recently from 5a4e9c0 to 137ac8b Compare August 13, 2024 20:27

Implement data-specific cartesian index

2243091

Extend data-specific cartesian index

charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch from 137ac8b to 2243091 Compare August 14, 2024 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement and extend data-specific cartesian index #1902

Implement and extend data-specific cartesian index #1902

Implement and extend data-specific cartesian index #1902

Are you sure you want to change the base?

Implement and extend data-specific cartesian index #1902

Conversation

Main

This PR

Choose a reason for hiding this comment