User:Ayush3504/sandbox: Difference between revisions

Content deleted Content added

Inline

Revision as of 09:49, 10 December 2015

This is the user sandbox of Ayush3504. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Communication-Avoiding Algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication i.e. moving data, either between levels of memory or between multiple processors over a network, is much more expensive than arithmetic[1]. Communication-Avoiding algorithms reduce or minimize communication to reduce overall cost of the algorithm.

Motivation

Consider the following running-time model:

Measure of computation = Time per FLOP = γ
Measure of communication = No. of words of data moved = β

⇒ Total running time = γ*(no. of FLOPs) + β*(no. of words)

Since β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends[2] indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices.

Source: John Shalf

2008 DARPA Exascale report predicts that gap between DRAM access time and flops will increase 100x over coming decade to balance power usage between processors and DRAM.

Minimizing communication also reduces energy consumption. Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy which is shown in fig x.

United States president Obama cited Communication-Avoiding Algorithms in the FY 2012 Department of Energy budget request to Congress:[1]

“New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the[4][5] algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.”

Objectives

Communication-Avoiding algorithms are designed with the following objectives:

Reorganize algorithms to reduce communication across all memory hierarchies.
Attain the lower-bounds on communication when possible.

The following simple example[1] explains how these objectives are achieved.

Classic Matrix Multiplication Example

Let A, B and C be square matrices of order n x n. Consider the multiplication algorithm:

 {implements C = C + A * B}
 for i = 1 to n
     for j = 1 to n
         for k = 1 to n
         C(i,j) = C(i,j) + A(i,k) * B(k,j)

Arithmetic cost (time-complexity): n² (2n-1) for large n or O(n³).

Rewriting this algorithm with communication cost labelled at each step

 {implements C = C + A*B}
 for i = 1 to n
     {read row i of A into fast memory}               …  n² reads
     for j = 1 to n
       {read C(i,j) into fast memory}                 …  n² reads
       {read column j of B into fast memory}  	       …  n³ reads
       for k = 1 to n
           C(i,j) = C(i,j) + A(i,k) * B(k,j)
           {write C(i,j) back to slow memory}         …  n² writes

Communication cost (reads/writes): n³ + 3n² or O(n³)

Since total running time = γ*O(n³) + β*O(n³) and β >> γ the communication cost is dominant. The following communication-avoiding algorithm improves on this bottleneck.

Blocked (Tiled) Matrix Multiplication

We first define the notion of “fast” and “slow” memory for this example. Fast memory may be defined as the local processor memory (cache) of size M and slow memory may be defined as the DRAM. Consider A,B,C to be n/b-by-n/b matrices of b-by-b subblocks where b is called the block size; assume 3 b-by-b blocks fit in fast memory

    for i = 1 to n/b
        for j = 1 to n/b
        {read block C(i,j) into fast memory}        	… b² × (n/b)² = n² reads
        for k = 1 to n/b
            {read block A(i,k) into fast memory}       … b² × (n/b)³ = n³/b reads 
            {read block B(k,j) into fast memory}	… b² × (n/b)³ = n³/b reads
            C(i,j) = C(i,j) + A(i,k) * B(k,j)            {do a matrix multiply on blocks}
        {write block C(i,j) back to slow memory}       … b² × (n/b)² = n² writes

Communication cost: 2n³/b + 2n² reads/writes << 2n³ arithmetic cost

Making b as large possible: 3b² ≤ M So lower bound is 3^1/2n³/M^1/2 + 2n² or Ω (no. of FLOPs / M^1/2 )

References [1] Demmel SC Tutorial [2] 2008 Darpa Exascale report [3] CAA and Fast Matrix Multiplication [4] CA-GMRES (Hiemmen) [5] Tall-Skinny QR (Grigori)