[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the field dimension + use linear indexing #1928

Open
10 of 12 tasks
charleskawczynski opened this issue Aug 13, 2024 · 0 comments
Open
10 of 12 tasks

Drop the field dimension + use linear indexing #1928

charleskawczynski opened this issue Aug 13, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request SDI Software Development Issue

Comments

@charleskawczynski
Copy link
Member
charleskawczynski commented Aug 13, 2024

From this very hacked branch:

https://github.com/CliMA/ClimaCore.jl/tree/ck/drop_field_dimension (PR #1929).

We can reach good bandwidth efficiency when combining linear indexing with dropped field dimensions on ClimaCore broadcasted objects for pointwise kernels (this is the thermo_bench_bw.jl benchmark script):

Main branch (Clima A100):

[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                              │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     TBB.thermo_func_bc!(x, us; nreps=100, bm)      │ 796 microseconds, 877 nanoseconds │ 12.4798254.46210100    │
└────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                              │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├────────────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     TBB.thermo_func_bc!(x, us; nreps=100, bm)      │ 1 millisecond, 43 microseconds    │ 19.0568388.56910100    │
└────────────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Branch with dropped field dimension + linear indexing (Clima A100):

julia> using Revise; include(joinpath("benchmarks", "scripts", "thermo_bench_bw.jl"))
WARNING: replacing module ThermoBenchBandwidth.
[ Info: device = ClimaComms.CUDADevice()
[ Info: Success!
Problem size: (63, 4, 4, 1, 5400), float_type = Float32, device_bandwidth_GBs=2039
┌───────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                         │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├───────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 131 microseconds, 503 nanoseconds │ 75.62491541.9910100    │
└───────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

Problem size: (63, 4, 4, 1, 5400), float_type = Float64, device_bandwidth_GBs=2039
┌───────────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬────────────────┬────────┐
│ funcs                                         │ time per call                     │ bw %    │ achieved bw │ n-reads/writes │ n-reps │
├───────────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼────────────────┼────────┤
│     TBB.thermo_func_bc!(x, us; nreps=100, bm) │ 256 microseconds, 379 nanoseconds │ 77.57911581.8410100    │
└───────────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴────────────────┴────────┘

One future proofing complication of this branch is that we will need to continue to support the field dimension being present (perhaps inside TupleOfArrays, or whatever we decide to call this new layer's struct) in order to still work reasonably with on the order of 100 tracers.

Just to note: dropping the field dimension roughly 2xed the performance, and using linear indexing accounted for the rest. As discussed with @tapios, only applying linear indexing seems to improve performance for broadcasting with single variables, but seems to degrade performance with multiple variables. So, it seems that both of these changes are needed in tandem to improve the performance.

cc @tapios

Tasks

  1. refactor
  2. refactor
  3. refactor
  4. refactor
  5. performance
@charleskawczynski charleskawczynski added enhancement New feature or request SDI Software Development Issue labels Aug 13, 2024
@charleskawczynski charleskawczynski self-assigned this Aug 13, 2024
This was referenced Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SDI Software Development Issue
Projects
None yet
Development

No branches or pull requests

1 participant