The Chronicles of Phi - part 4 - Hyper-Thread Phalanx

The prior part (3) of this blog showed the effects of the first-level implementation of the Hyper-Thread Phalanx. The change in programming yielded 9.7% improvement in performance for the small model, and little to no improvement in the large model. This left part 3 of this blog with the questions:

What is non-optimal about this strategy?
And: What can be improved?

There are two things, one is obvious, and the other is not so obvious.

Data alignment

The obvious thing, which I will now show you, is that vectorization improves with aligned data. Most compiler optimizations will examine the code of the loop, and when necessary, insert preamble code that tests for alignment and executes up until an alignment is reached (this is called peeling), then insert code that executes more efficiently with the now aligned data. Finally post-amble code is inserted to complete any remainder that may be present.

This sounds rather straightforward until you look at the inner loop:

#pragma simd   
          for (x = 1; x < nx-1; x++) { 
            ++c; 
            ++n; 
            ++s; 
            ++b; 
            ++t; 
            f2_t[c] = cc * f1_t[c] + cw * f1_t[c-1] + ce * f1_t[c+1] 
                + cs * f1_t[s] + cn * f1_t[n] + cb * f1_t[b] + ct * f1_t[t]; 
          }

In the preceding code, the two terms: f1_t[c-1] , and , f1_t[c+1] will muck up the vector alignment tests since [c-1], [c] and [c+1] can never all be aligned at the same time.

Are the compiler writers smart enough to offer some measure of optimization for such a loop?

As it turns out, they are able to offer some measure of optimization for such a loop.

Due to initial unknowns, the code has to do more work in the preamble and post-amble sections, as well as a reduced number of iterations of work in the fastest interior body loop of code.

Take particular note that the input array f1_t is indexed in seven different ways. This means that the preamble code that determines alignment may have to work on minor permutations of the seven references in an attempt to narrow in on the time when the largest number of references are vector aligned. This is non-trivial for the compiler code generation, as well as a potential area for additional overhead.

What can be improved?

Use aligned data when possible

This is addressed in an additional improvement to the coding of the tiled_HT2 program.

First, we require that the dimension NX be a multiple of REALs that fill a cache line. This is not an unreasonable requirement. The value of 256 was used in the original example code. It is not too much of a restriction to require that NX must be a multiple of 16 for floats, or 8 for doubles.

To assure alignment, I changed the malloc calls to allocate the arrays to use _mm_malloc with an alignment of cache line size (64). This is a relatively simple change. (This will be shown later after the next optimization tip that also affects allocation.)

Next, now that I know that NX is an even multiple of cache lines, and the arrays are cache line aligned, I can construct a function to process the innermost loop with the foreknowledge that six of the array references are cache aligned and two are not (the extra reference is the output array). The two that are not aligned are the references to [c-1] and [c+1]. The compiler, knowing beforehand what is aligned and what is not aligned does not have to insert code to make this determination. i.e. the compiler can reduce, or completely remove the preamble and post-amble code.

The second improvement (non-obvious improvement):

Redundancy can be good for you

Additional optimization can be achieved by redundantly process x=0 and x=nx-1 as if these cells were at the interior of the parallel pipette being processed. This means that the preamble and post-amble code for unaligned loops can be bypassed, and that the elements x=1:15 could be directly processed as an aligned vector (as opposed to one-by-one computation or unaligned vector computation). The same is done for the 16 elements where the last element (x=nx-1) computes differently than the other elements of the vector. This does mean that after calculating the incorrect values (for free) for x=0 and x=nx-1, we have to then perform a scalar calculation to insert the correct values into the x column. Essentially you exchanging two scalar loops of 16 iterations for two of (one 16-wide vector operation + one scalar operation) where the scalar operations are in L1 cache.

Adding the redundancy change necessitated allocating the arrays two vectors worth of elements larger than the actual array requirement, and returning the address of the 2^nd vector for the array pointer. Additionally, this requires zeroing one element preceding and following the working array size. The allocation then provides for one vector of addressable memory (and not used as valid data). Not doing so, could result in a page fault depending on location and extent of allocation.

Change to allocations:

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f1_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f1_padded != NULL);
  
  // advance one cache line into buffer
  REAL *f1 = f1_padded + N_REALS_PER_CACHE_LINE;
  
  f1[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f1[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

  // align the allocations to cache line
  // increase allocation size by 2 cache lines
  REAL *f2_padded = (REAL *)_mm_malloc(
    sizeof(REAL)*(nx*ny*nz + N_REALS_PER_CACHE_LINE*2),
    CACHE_LINE_SIZE);

  // assure allocation succeeded
  assert(f2_padded != NULL);
  
  // advance one cache line into buffer
  REAL *f2 = f2_padded + N_REALS_PER_CACHE_LINE;
  
  f2[-1] = 0.0;       // assure cell prior to array not Signaling NaN
  f2[nx*ny*nz] = 0.0; // assure cell following array not Signaling NaN

As an additional benefit the compiler can now generate more code using Fused Multiply and Add (FMA) instructions.

The tiled_HT2 code follows:

void diffusion_tiled_aligned(
                REAL*restrict f2_t_c, // aligned
                REAL*restrict f1_t_c, // aligned
                REAL*restrict f1_t_w, // not aligned
                REAL*restrict f1_t_e, // not aligned
                REAL*restrict f1_t_s, // aligned
                REAL*restrict f1_t_n, // aligned
                REAL*restrict f1_t_b, // aligned
                REAL*restrict f1_t_t, // aligned
                REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
                REAL cb, REAL cc, int countX, int countY) {

  __assume_aligned(f2_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_c, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_s, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_n, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_b, CACHE_LINE_SIZE);
  __assume_aligned(f1_t_t, CACHE_LINE_SIZE);
  // countY is number of squads along Y axis
  for(int iY = 0; iY < countY; ++iY) {
    // perform the x=0:N_REALS_PER_CACHE_LINE-1 as one cache line operation
    // On Phi, the following reduces to vector with one iteration
    // On AVX two iterations
    // On SSE four iterations
    #pragma noprefetch
    #pragma simd  
    for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++) {
      f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)
    
    // now overstrike x=0 with correct value
    // x=0 special (no f1_t[c-1])
    f2_t_c[0] = cc * f1_t_c[0] + cw * f1_t_w[1] + ce * f1_t_e[0]
                + cs * f1_t_s[0] + cn * f1_t_n[0] + cb * f1_t_b[0] + ct * f1_t_t[0];
    // Note, while we could overstrike x=[0] and [nx-1] after processing the entire depth of nx
    // doing so will result in the x=0th cell being evicted from L1 cache.

    // do remainder of countX run including incorrect value for i=nx-1 (countX-1)
    #pragma vector nontemporal
    #pragma noprefetch
    #pragma simd  
    for (int i = N_REALS_PER_CACHE_LINE; i < countX; i++) {
        f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i] + ce * f1_t_e[i]
                 + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];
    } // for (int i = 0; i < N_REALS_PER_CACHE_LINE; i++)

    // now overstrike x=nx-1 with correct value
    // x=nx-1 special (no f1_t[c+1])
    int i = countX-1;
    f2_t_c[i] = cc * f1_t_c[i] + cw * f1_t_w[i-1] + ce * f1_t_e[i]
                   + cs * f1_t_s[i] + cn * f1_t_n[i] + cb * f1_t_b[i] + ct * f1_t_t[i];

    // advance one step along Y
    f2_t_c += countX;
    f1_t_c += countX;
    f1_t_w += countX;
    f1_t_e += countX;
    f1_t_s += countX;
    f1_t_n += countX;
    f1_t_b += countX;
    f1_t_t += countX;
  } // for(int iY = 0; iY < countY; ++iY)
} // void diffusion_tiled_aligned(

diffusion_tiled(REAL *restrict f1, REAL *restrict f2, int nx, int ny, int nz,
              REAL ce, REAL cw, REAL cn, REAL cs, REAL ct,
              REAL cb, REAL cc, REAL dt, int count) {

#pragma omp parallel
  {

    REAL *f1_t = f1;
    REAL *f2_t = f2;

    int nSquadsZ = (nz + nHTs - 1) / nHTs; // place squads across z dimension
    int nSquadsZY = nSquadsZ * ny;  // number of full (and partial) squads on z-y face
    int nSquadsZYPerCore = (nSquadsZY + nCores - 1) / nCores;

    // Determine this thread's squads
    int SquadBegin = nSquadsZYPerCore * myCore;
    int SquadEnd = SquadBegin + nSquadsZYPerCore; // 1 after last squad for core
    if(SquadEnd > nSquadsZY) SquadEnd = nSquadsZY;
    for (int i = 0; i < count; ++i) {
      int nSquads;
      // restrict current thread to its subset of squads on the Z/Y face.
      for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads) {
        // determine nSquads for this pass
        if(iSquad % ny == 0)
          nSquads = 1; // at y==0 boundary
        else
        if(iSquad % ny == ny - 1)
          nSquads = 1;  // at y==ny-1 boundary
        else
        if(iSquad / ny == (SquadEnd - 1) / ny)
          nSquads = SquadEnd - iSquad;  // within (inclusive) 1:ny-1
        else
          nSquads = ny - (iSquad % ny) - 1; // restrict from iSquad%ny to ny-1
        int z0 = (iSquad / ny) * nHTs; // home z for 0'th team member of Squad
        int z = z0 + myHT;  // z for this team member
        int y = iSquad % ny;
        // last squad along z may be partially filled
        // assure we are within z
        if(z < nz)
        {
          int x = 0;
          int c, n, s, b, t;
          c =  x + y * nx + z * nx * ny;
          n = (y == 0)    ? c : c - nx;
          s = (y == ny-1) ? c : c + nx;
          b = (z == 0)    ? c : c - nx * ny;
          t = (z == nz-1) ? c : c + nx * ny;
          diffusion_tiled_aligned(
   &f2_t[c], // aligned
   &f1_t[c], // aligned
   &f1_t[c-1], // unaligned
   &f1_t[c+1], // unaligned
   &f1_t[s], // aligned
   &f1_t[n], // aligned
   &f1_t[b], // aligned
   &f1_t[t], // aligned
                        ce, cw, cn, cs, ct, cb, cc, nx, nSquads);
        } // if(z < nz)
      } // for(int iSquad = SquadBegin; iSquad < SquadEnd; iSquad += nSquads)
// barrier required because we removed implicit barrier of #pragma omp for collapse(2)
      #pragma omp barrier
      // swap buffer pointers
      REAL *t = f1_t;
      f1_t = f2_t;
      f2_t = t;
    } // count
  } // parallel
  return;
}

The performance chart below incorporates the two new programs tiled_HT and tiled_HT2.

The above chart clearly illustrates that the tiledHT2 is starting to make some real progress, at least for the small model with another 9.5% improvement. Be mindful that code alignment may still be an issue. And the above chart does not take this into consideration.

What else can be improved?

Think about it while you await part 5.

Jim Dempsey
Consultant
QuickThread Programming, LLC

The Chronicles of Phi - part 4 - Hyper-Thread Phalanx – tiled_HT2

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112