『高性能计算』BLISLAB代码笔记

项目仓库地址：ybNo1/blislab: BLISlab: A Sandbox for Optimizing GEMM

首先需要注意两点：

blislab 的代码是列主序的。
引入了 主元的跨度 （leading dimension，ld）这个概念。C(i, j) 的含义是 C[j * ldc + i]，如下图所示：

以列主元为例，说明跳到下一列的时候不应该 + m，而是 + ldc。

step1：基础优化

从数学原理的角度观察，跟最朴素的矩阵乘没区别。

/*
 * --------------------------------------------------------------------------
 * BLISLAB 
 * --------------------------------------------------------------------------
 * Copyright (C) 2016, The University of Texas at Austin
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are
 * met:
 *  - Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  - Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  - Neither the name of The University of Texas nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 *
 * bl_dgemm.c
 *
 *
 * Purpose:
 * this is the main file of blislab dgemm.
 *
 * Todo:
 *
 *
 * Modification:
 *
 * 
 * */
 

#include "bl_dgemm.h"

void AddDot( int k, double *A, int lda, double *B, int ldb, double *result ) {
  int p;
  for ( p = 0; p < k; p++ ) {
    *result += A( 0, p ) * B( p, 0 );
  }
}


void AddDot_MRxNR( int k, double *A, int lda, double *B, int ldb, double *C, int ldc )
{
  int ir, jr;
  int p;
  for ( jr = 0; jr < DGEMM_NR; jr++ ) {
    for ( ir = 0; ir < DGEMM_MR; ir++ ) {

      AddDot( k, &A( ir, 0 ), lda, &B( 0, jr ), ldb, &C( ir, jr ) );

    }
  }
}

void bl_dgemm(
    int    m,
    int    n,
    int    k,
    double *A,
    int    lda,
    double *B,
    int    ldb,
    double *C,        // must be aligned
    int    ldc        // ldc must also be aligned
)
{
    int    i, j, p;
    int    ir, jr;

    // Early return if possible
    if ( m == 0 || n == 0 || k == 0 ) {
        printf( "bl_dgemm(): early return\n" );
        return;
    }

    for ( j = 0; j < n; j += DGEMM_NR ) {          // Start 2-nd loop
        for ( i = 0; i < m; i += DGEMM_MR ) {      // Start 1-st loop

            AddDot_MRxNR( k, &A( i, 0 ), lda, &B( 0, j ), ldb, &C( i, j ), ldc );

        }                                          // End   1-st loop
    }                                              // End   2-nd loop

}

通过阅读源码，可知采用了如下的基础优化技术：

使用指针取代对地址的直接计算

使用指针前：

for ( i = 0; i < m; i ++ ) {
	for ( j = 0; j < n; j ++ ) {
		C( i, j ) = 0.0;
	}
}

使用指针后：

double *cp;

for ( j = 0; j < n; j ++ ) {
	cp = &C[ j * ldc ]; // point cp to top of jth column
	for ( i = 0; i < m; i ++ ) {
		*cp++ = 0.0; // set the element that cp points to to zero and
		// advance the pointer.
	}
}

循环展开

本质上是在朴素矩阵乘法的基础上做一个基础的分块，其中，MR 和 NR 约束了分块的规模，缺省时是 1 x 1 ，也就是没有分块。

可以看到，AddDot 函数中看似冗杂的运算其实是重复利用读取的 a 和 b，把中间结果保存在寄存器里。

但这一层次的复用只是针对寄存器做的分块（也就是复用读取到寄存器的数据），矩阵规模变大了之后，子矩阵规模可能也超过了 cache 的大小，导致很严重的 cache miss 。

step2：针对多级 cache 的分块

基于 step1 寄存器分块的缺点，提出对分块优化的更高要求：操纵 cache（也就是 cache 分块）。如果想要达到这个目标，就需要讨论：

哪些数据放到 cache 中
将 cache 中的哪些数据替换出去

但实际上我们不能像“操纵”寄存器那样 —— 通过写一些限定代码让这一段数据在寄存器中被操作 —— 来操纵 cache ，本质是 CPU 不能被直接操纵。因此我们做一些 tradeoff ，间接完成类似的效果：

使用 pack 函数（打包函数）
- 含义是将不连续的数据 copy 到连续的内存中，之后从连续的内存读取数据，进而利用硬件上采用的一些替换算法提高分块的访存性能，减小 cache miss
- 需要额外申请一块缓冲区
- 使用 Z 字型做排列（至于为什么是 Z 字型，跟条带分割的算法有关）
某些芯片可以使用 预取指令 （prefetch），让一些数据提前被读到 cache 里

此处还不得不考虑一个问题：cache 的读取是以 cache line 为单位的，也就是说存在对齐的需求。在对齐的情况下更能发挥内存带宽等处理器性能。这一步可以通过调整申请的 buffer 的规模来控制。

/*
 * --------------------------------------------------------------------------
 * BLISLAB 
 * --------------------------------------------------------------------------
 * Copyright (C) 2016, The University of Texas at Austin
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are
 * met:
 *  - Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  - Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  - Neither the name of The University of Texas nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 * HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 *
 * bl_dgemm.c
 *
 *
 * Purpose:
 * this is the main file of blislab dgemm.
 *
 * Todo:
 *
 *
 * Modification:
 *
 * 
 * */

#include <stdio.h>

#include "bl_dgemm_kernel.h"
#include "bl_dgemm.h"

inline void packA_mcxkc_d(
        int    m,
        int    k,
        double *XA,
        int    ldXA,
        int    offseta,
        double *packA
        )
{
    int    i, p;
    double *a_pntr[ DGEMM_MR ];

    for ( i = 0; i < m; i ++ ) {
        a_pntr[ i ] = XA + ( offseta + i );
    }

    for ( i = m; i < DGEMM_MR; i ++ ) {
        a_pntr[ i ] = XA + ( offseta + 0 );
    }

    for ( p = 0; p < k; p ++ ) {
        for ( i = 0; i < DGEMM_MR; i ++ ) {
            *packA = *a_pntr[ i ];
            packA ++;
            a_pntr[ i ] = a_pntr[ i ] + ldXA;
        }
    }
}


/*
 * --------------------------------------------------------------------------
 */

inline void packB_kcxnc_d(
        int    n,
        int    k,
        double *XB,
        int    ldXB, // ldXB is the original k
        int    offsetb,
        double *packB
        )
{
    int    j, p; 
    double *b_pntr[ DGEMM_NR ];

    for ( j = 0; j < n; j ++ ) {
        b_pntr[ j ] = XB + ldXB * ( offsetb + j );
    }

    for ( j = n; j < DGEMM_NR; j ++ ) {
        b_pntr[ j ] = XB + ldXB * ( offsetb + 0 );
    }

    for ( p = 0; p < k; p ++ ) {
        for ( j = 0; j < DGEMM_NR; j ++ ) {
            *packB ++ = *b_pntr[ j ] ++;
        }
    }
}

/*
 * --------------------------------------------------------------------------
 */
void bl_macro_kernel(
        int    m,
        int    n,
        int    k,
        double *packA,
        double *packB,
        double *C,
        int    ldc
        )
{
    int bl_ic_nt;
    int    i, ii, j;
    aux_t  aux;
    char *str;

    aux.b_next = packB;

    for ( j = 0; j < n; j += DGEMM_NR ) {                        // 2-th loop around micro-kernel
        aux.n  = min( n - j, DGEMM_NR );
        for ( i = 0; i < m; i += DGEMM_MR ) {                    // 1-th loop around micro-kernel
            aux.m = min( m - i, DGEMM_MR );
            if ( i + DGEMM_MR >= m ) {
                aux.b_next += DGEMM_NR * k;
            }

            ( *bl_micro_kernel ) (
                    k,
                    &packA[ i * k ],
                    &packB[ j * k ],
                    &C[ j * ldc + i ],
                    (unsigned long long) ldc,
                    &aux
                    );
        }                                                        // 1-th loop around micro-kernel
    }                                                            // 2-th loop around micro-kernel
}

// C must be aligned
void bl_dgemm(
        int    m,
        int    n,
        int    k,
        double *XA,
        int    lda,
        double *XB,
        int    ldb,
        double *C,        // must be aligned
        int    ldc        // ldc must also be aligned
        )
{
    int    i, j, p, bl_ic_nt;
    int    ic, ib, jc, jb, pc, pb;
    int    ir, jr;
    double *packA, *packB;
    char   *str;

    // Early return if possible
    if ( m == 0 || n == 0 || k == 0 ) {
        printf( "bl_dgemm(): early return\n" );
        return;
    }

    // sequential is the default situation
    bl_ic_nt = 1;
    // check the environment variable
    str = getenv( "BLISLAB_IC_NT" );
    if ( str != NULL ) {
        bl_ic_nt = (int)strtol( str, NULL, 10 );
    }

    // Allocate packing buffers
    packA  = bl_malloc_aligned( DGEMM_KC, ( DGEMM_MC + 1 ) * bl_ic_nt, sizeof(double) );
    packB  = bl_malloc_aligned( DGEMM_KC, ( DGEMM_NC + 1 )           , sizeof(double) );

    for ( jc = 0; jc < n; jc += DGEMM_NC ) {                                       // 5-th loop around micro-kernel
        jb = min( n - jc, DGEMM_NC );
        for ( pc = 0; pc < k; pc += DGEMM_KC ) {                                   // 4-th loop around micro-kernel
            pb = min( k - pc, DGEMM_KC );

            for ( j = 0; j < jb; j += DGEMM_NR ) {
                packB_kcxnc_d(
                        min( jb - j, DGEMM_NR ),
                        pb,
                        &XB[ pc ],
                        k, // should be ldXB instead
                        jc + j,
                        &packB[ j * pb ]
                        );
            }


            for ( ic = 0; ic < m; ic += DGEMM_MC ) {                               // 3-rd loop around micro-kernel

                ib = min( m - ic, DGEMM_MC );

                for ( i = 0; i < ib; i += DGEMM_MR ) {
                    packA_mcxkc_d(
                            min( ib - i, DGEMM_MR ),
                            pb,
                            &XA[ pc * lda ],
                            m,
                            ic + i,
                            &packA[ 0 * DGEMM_MC * pb + i * pb ]
                            );
                }

                bl_macro_kernel(
                        ib,
                        jb,
                        pb,
                        packA  + 0 * DGEMM_MC * pb,
                        packB,
                        &C[ jc * ldc + ic ], 
                        ldc
                        );
            }                                                                     // End 3.rd loop around micro-kernel
        }                                                                         // End 4.th loop around micro-kernel
    }                                                                             // End 5.th loop around micro-kernel

    free( packA );
    free( packB );
}

BlisLab 的 tutorial 中用了一张比较吓人的图说明 GotoBLAS 中针对三级 cache 使用的分块算法（以及排流水的 Piping 过程）：

不过，如果使用 GPU 的话，程序员可以通过软件操控 cache ，这里就可以使用很多数据搬运的 tricks，例如从 global memory 搬运到 shared memory，从 shared memory 搬运到寄存器等。只不过这种手动控制的编程复杂度很高。

一个常用的 GPU 编程优化点叫做 双缓冲优化 ，其一般流程是：

Global memory → shared memory → 计算 → 写回 memory

双缓冲的过程体现在将 shared memory 一分为二，一半用于做本次运算，一半用于做下一次传输，DMA 的数据传输与内核的运算过程重叠，以平滑多级存储结构的数据传输，让内核始终以峰值速度运行 —— 跟 prefetch 的作用基本上一样。

step3：并行化

基本上就是基于 SIMD 的向量化编程工作，需要讨论中间结果在寄存器中如何排布 / 如何使用合适的汇编指令计算等的问题。了解了对应体系结构的汇编指令细节之后就没什么问题。

例如，光两个 4 x 1 向量相乘得到结果矩阵的 “一层” 的问题，就有两种汇编指令解决方案：直接 Broadcast 模拟向量乘法，或者使用蝶式排列（Butterfly Permutation），如下所示。

P.S. 体系结构中有许多种互连函数，包括蝶式函数、均匀洗牌函数、反位序函数、PM2I 等。

参考：

N. H. F. Beebe, et al. “High-Performance Matrix Multiplication,” Feb. 1970.

张先轶, et al. “OpenBLAS:龙芯3A CPU 的高性能BLAS 库.”

BLISlab: A Sandbox for Optimizing GEMM, https://github.com/ybNo1/blislab/tree/master

Author:
Gwok Hiujin
Copyright:
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Link:
https://gwokhiujin.github.io/2023/06/26/『高性能计算』BLISLAB代码笔记/
Published:
June 26, 2023
Updated:
June 26, 2023

Buy me a cup of coffee ☕.