[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement and extend data-specific cartesian index #1902

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

charleskawczynski
Copy link
Member
@charleskawczynski charleskawczynski commented Jul 24, 2024

This PR extends and is a generalization of #1900. Supersedes #1900.

This PR effectively fixes the current issue, where we have uncoalesced reads/writes in our pointwise broadcast expressions.

The fix was to specialize on broadcast expressions where we don't need to use a universal cartesian index for all datalayouts, but instead use a datalayout-specific index (defined as DataSpecificCartesianIndex).

Then, we define getindex/setindex on datalayouts for this index type, and do not swap the indices, as we've launch the kernel with CartesianIndices corresponding to size(parent(data)) (as opposed to size(data) for the generalized case).

The result is that all data is accessed in a contiguous / coalesced fashion.

@charleskawczynski charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch 2 times, most recently from b41eb05 to cd17c13 Compare July 25, 2024 13:19
@charleskawczynski charleskawczynski marked this pull request as ready for review July 25, 2024 13:23
@charleskawczynski charleskawczynski changed the title Extend data-specific cartesian ind Implement and extend data-specific cartesian index Jul 25, 2024
@charleskawczynski
Copy link
Member Author
charleskawczynski commented Jul 25, 2024

On the Clima node, here's how the benchmark copyto! for pointwise broadcast expressions change:

Main

Benchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  116.978 μs   3.014 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     120.999 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.499 μs ± 29.288 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▁▂▃▄▅▅▇█▅▇▇▇▆▅▆▆▇▄▄▄▂▁▁                     
  ▁▁▁▁▁▁▁▂▂▂▂▃▄▅▆█▇██████████████████████████▆▆▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂ ▅
  117 μs          Histogram: frequency by time          125 μs <

 Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  112.709 μs   3.142 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     114.778 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   115.419 μs ± 30.815 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▄▄▄▇█▆▄▄▂▁                                      
  ▁▁▁▁▂▂▂▂▃▃▄▆▆███████████▇▆▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  113 μs          Histogram: frequency by time          119 μs <

 Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary:       | Pass  Total   Time
copyto! with Nf = 1 |    1      1  29.2s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913773875106e9, 1.721913803097613e9, false, "REPL[9]")

This PR

Benchmarking ClimaCore copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  64.350 μs   2.607 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     67.989 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   68.561 μs ± 27.209 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▄▆▆▅▅▃▄▄▆▄▅▆██▆▆▅▄▄▅▄▂▁▁▁▁                    
  ▁▁▂▃▃▄▄▅▅▄▄▄▄▆███████████████████████████▇▆▆▅▅▅▄▄▃▃▃▃▂▂▂▁▁▁ ▅
  64.4 μs         Histogram: frequency by time        72.2 μs <

 Memory estimate: 720 bytes, allocs estimate: 28.
Benchmarking array copyto! for VIJFH DataLayout
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  111.759 μs   4.475 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     113.349 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.971 μs ± 43.903 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▄▅▇▇██▇▇▅▄▂▁                                      
  ▁▁▁▁▁▂▂▄▅▆██████████████▇▆▅▄▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  112 μs          Histogram: frequency by time          117 μs <

 Memory estimate: 1.50 KiB, allocs estimate: 65.
Test Summary:       | Pass  Total   Time
copyto! with Nf = 1 |    1      1  28.5s
Test.DefaultTestSet("copyto! with Nf = 1", Any[], 1, false, false, true, 1.721913933078668e9, 1.721913961585961e9, false, "REPL[9]")

From a hardware performance POV, we'll now have:

julia> function bandwidth_efficiency(;
           problem_size=(63,4,4,1,5400),
           float_type,
           device_bandwidth_GBs=2_039, # clima A100 specs: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
           kernel_time_s,
           n_reads_writes
         )
         N = prod(problem_size)
         GB = N*n_reads_writes*sizeof(float_type)/1024^3
         achieved_bandwidth_GBs = GB/kernel_time_s
         percent_efficiency = achieved_bandwidth_GBs/device_bandwidth_GBs*100
         @info "Bandwidth info" problem_size float_type achieved_bandwidth_GBs device_bandwidth_GBs percent_efficiency
       end
bandwidth_efficiency (generic function with 1 method)

julia> bandwidth_efficiency(; # main (with index swapping)
           float_type=Float64, kernel_time_s=(116.978)/10^6, n_reads_writes=2
       )
┌ Info: Bandwidth info
│   problem_size = (63, 4, 4, 1, 5400)
│   float_type = Float64
│   achieved_bandwidth_GBs = 693.3782472802712
│   device_bandwidth_GBs = 2039
└   percent_efficiency = 34.00579927809079

julia> bandwidth_efficiency(; # this PR (no index swapping)
           float_type=Float64, kernel_time_s=(64.350)/10^6, n_reads_writes=2
       )
┌ Info: Bandwidth info
│   problem_size = (63, 4, 4, 1, 5400)
│   float_type = Float64
│   achieved_bandwidth_GBs = 1260.4506699355334
│   device_bandwidth_GBs = 2039
└   percent_efficiency = 61.817100045881965

From https://developer.download.nvidia.com/GTC/PDF/1083_Wang.pdf, slide 35:

70-80% is very good

So I'd say that this PR puts us in the "good" range for pointwise broadcast expressions, 🎉. cc @tapios

Next, we should fix the stencil kernels. That will be more tricky since the indexing is split up, and I think some of those may use the non-cartesian indexing. But we should see similar speedups there.

For future self: the reason that we're faster than ordinary CuArrays, is because our datalayouts now know completely static sizes, and offsets can be determined at compile time. For CuArrays, the sizes are not compile-time known.

test/runtests.jl Outdated Show resolved Hide resolved
src/DataLayouts/DataLayouts.jl Show resolved Hide resolved
src/DataLayouts/cartesian_index.jl Show resolved Hide resolved
src/DataLayouts/has_uniform_datalayouts.jl Show resolved Hide resolved
src/DataLayouts/has_uniform_datalayouts.jl Outdated Show resolved Hide resolved
src/DataLayouts/has_uniform_datalayouts.jl Outdated Show resolved Hide resolved
Copy link
Member
@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd add a comment about the need for redefining the broadcast machinery before merging (as you did in the comment).

@charleskawczynski charleskawczynski force-pushed the ck/extend_data_specific_cart_ind branch 2 times, most recently from 5a4e9c0 to 137ac8b Compare August 13, 2024 20:27
Extend data-specific cartesian index
@charleskawczynski
Copy link
Member Author
charleskawczynski commented Aug 15, 2024

Just to cycle back on this PR: It turns out that this solution is pretty effective for Float64, but not so much for Float32 (xref)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants