Gwok HiujinGwok Hiujin

The Bird of Hermes is my name, eating my wings to make me tame.

Jun 26, 2023高性能计算2742 words in 14 min


『高性能计算』BLISLAB代码笔记

项目仓库地址:ybNo1/blislab: BLISlab: A Sandbox for Optimizing GEMM


首先需要注意两点:

  • blislab 的代码是列主序的。

  • 引入了 主元的跨度 (leading dimension,ld)这个概念。C(i, j) 的含义是 C[j * ldc + i],如下图所示:

    pCUhJJA.png

    以列主元为例,说明跳到下一列的时候不应该 + m,而是 + ldc。

step1:基础优化

从数学原理的角度观察,跟最朴素的矩阵乘没区别。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/*
* --------------------------------------------------------------------------
* BLISLAB
* --------------------------------------------------------------------------
* Copyright (C) 2016, The University of Texas at Austin
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
* met:
* - Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* - Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* - Neither the name of The University of Texas nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
* HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
*
* bl_dgemm.c
*
*
* Purpose:
* this is the main file of blislab dgemm.
*
* Todo:
*
*
* Modification:
*
*
* */


#include "bl_dgemm.h"

void AddDot( int k, double *A, int lda, double *B, int ldb, double *result ) {
int p;
for ( p = 0; p < k; p++ ) {
*result += A( 0, p ) * B( p, 0 );
}
}


void AddDot_MRxNR( int k, double *A, int lda, double *B, int ldb, double *C, int ldc )
{
int ir, jr;
int p;
for ( jr = 0; jr < DGEMM_NR; jr++ ) {
for ( ir = 0; ir < DGEMM_MR; ir++ ) {

AddDot( k, &A( ir, 0 ), lda, &B( 0, jr ), ldb, &C( ir, jr ) );

}
}
}

void bl_dgemm(
int m,
int n,
int k,
double *A,
int lda,
double *B,
int ldb,
double *C, // must be aligned
int ldc // ldc must also be aligned
)
{
int i, j, p;
int ir, jr;

// Early return if possible
if ( m == 0 || n == 0 || k == 0 ) {
printf( "bl_dgemm(): early return\n" );
return;
}

for ( j = 0; j < n; j += DGEMM_NR ) { // Start 2-nd loop
for ( i = 0; i < m; i += DGEMM_MR ) { // Start 1-st loop

AddDot_MRxNR( k, &A( i, 0 ), lda, &B( 0, j ), ldb, &C( i, j ), ldc );

} // End 1-st loop
} // End 2-nd loop

}

通过阅读源码,可知采用了如下的基础优化技术:

使用指针取代对地址的直接计算

使用指针前:

1
2
3
4
5
for ( i = 0; i < m; i ++ ) {
for ( j = 0; j < n; j ++ ) {
C( i, j ) = 0.0;
}
}

使用指针后:

1
2
3
4
5
6
7
8
9
double *cp;

for ( j = 0; j < n; j ++ ) {
cp = &C[ j * ldc ]; // point cp to top of jth column
for ( i = 0; i < m; i ++ ) {
*cp++ = 0.0; // set the element that cp points to to zero and
// advance the pointer.
}
}

循环展开

本质上是在朴素矩阵乘法的基础上做一个基础的分块,其中,MR 和 NR 约束了分块的规模,缺省时是 1 x 1 ,也就是没有分块。

可以看到,AddDot 函数中看似冗杂的运算其实是重复利用读取的 a 和 b,把中间结果保存在寄存器里。

但这一层次的复用只是针对寄存器做的分块(也就是复用读取到寄存器的数据),矩阵规模变大了之后,子矩阵规模可能也超过了 cache 的大小,导致很严重的 cache miss 。

step2:针对多级 cache 的分块

pCUhGid.png

基于 step1 寄存器分块的缺点,提出对分块优化的更高要求:操纵 cache(也就是 cache 分块)。如果想要达到这个目标,就需要讨论:

  • 哪些数据放到 cache 中
  • 将 cache 中的哪些数据替换出去

但实际上我们不能像“操纵”寄存器那样 —— 通过写一些限定代码让这一段数据在寄存器中被操作 —— 来操纵 cache ,本质是 CPU 不能被直接操纵。因此我们做一些 tradeoff ,间接完成类似的效果:

  • 使用 pack 函数(打包函数)
    • 含义是将不连续的数据 copy 到连续的内存中,之后从连续的内存读取数据,进而利用硬件上采用的一些替换算法提高分块的访存性能,减小 cache miss
    • 需要额外申请一块缓冲区
    • 使用 Z 字型做排列(至于为什么是 Z 字型,跟条带分割的算法有关)
      • pCUh1de.png
  • 某些芯片可以使用 预取指令 (prefetch),让一些数据提前被读到 cache 里

pCUhtzt.jpg

此处还不得不考虑一个问题:cache 的读取是以 cache line 为单位的,也就是说存在 对齐 的需求。在对齐的情况下更能发挥内存带宽等处理器性能。这一步可以通过调整申请的 buffer 的规模来控制。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
/*
* --------------------------------------------------------------------------
* BLISLAB
* --------------------------------------------------------------------------
* Copyright (C) 2016, The University of Texas at Austin
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
* met:
* - Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* - Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* - Neither the name of The University of Texas nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
* HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
*
* bl_dgemm.c
*
*
* Purpose:
* this is the main file of blislab dgemm.
*
* Todo:
*
*
* Modification:
*
*
* */

#include <stdio.h>

#include "bl_dgemm_kernel.h"
#include "bl_dgemm.h"

inline void packA_mcxkc_d(
int m,
int k,
double *XA,
int ldXA,
int offseta,
double *packA
)
{
int i, p;
double *a_pntr[ DGEMM_MR ];

for ( i = 0; i < m; i ++ ) {
a_pntr[ i ] = XA + ( offseta + i );
}

for ( i = m; i < DGEMM_MR; i ++ ) {
a_pntr[ i ] = XA + ( offseta + 0 );
}

for ( p = 0; p < k; p ++ ) {
for ( i = 0; i < DGEMM_MR; i ++ ) {
*packA = *a_pntr[ i ];
packA ++;
a_pntr[ i ] = a_pntr[ i ] + ldXA;
}
}
}


/*
* --------------------------------------------------------------------------
*/

inline void packB_kcxnc_d(
int n,
int k,
double *XB,
int ldXB, // ldXB is the original k
int offsetb,
double *packB
)
{
int j, p;
double *b_pntr[ DGEMM_NR ];

for ( j = 0; j < n; j ++ ) {
b_pntr[ j ] = XB + ldXB * ( offsetb + j );
}

for ( j = n; j < DGEMM_NR; j ++ ) {
b_pntr[ j ] = XB + ldXB * ( offsetb + 0 );
}

for ( p = 0; p < k; p ++ ) {
for ( j = 0; j < DGEMM_NR; j ++ ) {
*packB ++ = *b_pntr[ j ] ++;
}
}
}

/*
* --------------------------------------------------------------------------
*/
void bl_macro_kernel(
int m,
int n,
int k,
double *packA,
double *packB,
double *C,
int ldc
)
{
int bl_ic_nt;
int i, ii, j;
aux_t aux;
char *str;

aux.b_next = packB;

for ( j = 0; j < n; j += DGEMM_NR ) { // 2-th loop around micro-kernel
aux.n = min( n - j, DGEMM_NR );
for ( i = 0; i < m; i += DGEMM_MR ) { // 1-th loop around micro-kernel
aux.m = min( m - i, DGEMM_MR );
if ( i + DGEMM_MR >= m ) {
aux.b_next += DGEMM_NR * k;
}

( *bl_micro_kernel ) (
k,
&packA[ i * k ],
&packB[ j * k ],
&C[ j * ldc + i ],
(unsigned long long) ldc,
&aux
);
} // 1-th loop around micro-kernel
} // 2-th loop around micro-kernel
}

// C must be aligned
void bl_dgemm(
int m,
int n,
int k,
double *XA,
int lda,
double *XB,
int ldb,
double *C, // must be aligned
int ldc // ldc must also be aligned
)
{
int i, j, p, bl_ic_nt;
int ic, ib, jc, jb, pc, pb;
int ir, jr;
double *packA, *packB;
char *str;

// Early return if possible
if ( m == 0 || n == 0 || k == 0 ) {
printf( "bl_dgemm(): early return\n" );
return;
}

// sequential is the default situation
bl_ic_nt = 1;
// check the environment variable
str = getenv( "BLISLAB_IC_NT" );
if ( str != NULL ) {
bl_ic_nt = (int)strtol( str, NULL, 10 );
}

// Allocate packing buffers
packA = bl_malloc_aligned( DGEMM_KC, ( DGEMM_MC + 1 ) * bl_ic_nt, sizeof(double) );
packB = bl_malloc_aligned( DGEMM_KC, ( DGEMM_NC + 1 ) , sizeof(double) );

for ( jc = 0; jc < n; jc += DGEMM_NC ) { // 5-th loop around micro-kernel
jb = min( n - jc, DGEMM_NC );
for ( pc = 0; pc < k; pc += DGEMM_KC ) { // 4-th loop around micro-kernel
pb = min( k - pc, DGEMM_KC );

for ( j = 0; j < jb; j += DGEMM_NR ) {
packB_kcxnc_d(
min( jb - j, DGEMM_NR ),
pb,
&XB[ pc ],
k, // should be ldXB instead
jc + j,
&packB[ j * pb ]
);
}


for ( ic = 0; ic < m; ic += DGEMM_MC ) { // 3-rd loop around micro-kernel

ib = min( m - ic, DGEMM_MC );

for ( i = 0; i < ib; i += DGEMM_MR ) {
packA_mcxkc_d(
min( ib - i, DGEMM_MR ),
pb,
&XA[ pc * lda ],
m,
ic + i,
&packA[ 0 * DGEMM_MC * pb + i * pb ]
);
}

bl_macro_kernel(
ib,
jb,
pb,
packA + 0 * DGEMM_MC * pb,
packB,
&C[ jc * ldc + ic ],
ldc
);
} // End 3.rd loop around micro-kernel
} // End 4.th loop around micro-kernel
} // End 5.th loop around micro-kernel

free( packA );
free( packB );
}

BlisLab 的 tutorial 中用了一张比较吓人的图说明 GotoBLAS 中针对三级 cache 使用的分块算法(以及排流水的 Piping 过程):

pCUh3IH.png


不过,如果使用 GPU 的话,程序员可以通过软件操控 cache ,这里就可以使用很多数据搬运的 tricks,例如从 global memory 搬运到 shared memory,从 shared memory 搬运到寄存器等。只不过这种手动控制的编程复杂度很高。

一个常用的 GPU 编程优化点叫做 双缓冲优化 ,其一般流程是:

Global memory → shared memory → 计算 → 写回 memory

双缓冲的过程体现在将 shared memory 一分为二,一半用于做本次运算,一半用于做下一次传输,DMA 的数据传输与内核的运算过程重叠,以平滑多级存储结构的数据传输,让内核始终以峰值速度运行 —— 跟 prefetch 的作用基本上一样。

step3:并行化

基本上就是 基于 SIMD 的向量化编程 工作,需要讨论中间结果在寄存器中如何排布 / 如何使用合适的汇编指令计算等的问题。了解了对应体系结构的汇编指令细节之后就没什么问题。

例如,光两个 4 x 1 向量相乘得到结果矩阵的 “一层” 的问题,就有两种汇编指令解决方案:直接 Broadcast 模拟向量乘法,或者使用蝶式排列(Butterfly Permutation),如下所示。

pCUhUQP.png

pCUhYRI.png

P.S. 体系结构中有许多种互连函数,包括蝶式函数、均匀洗牌函数、反位序函数、PM2I 等。


  • 参考:
    • N. H. F. Beebe, et al. “High-Performance Matrix Multiplication,” Feb. 1970.
    • 张先轶, et al. “OpenBLAS:龙芯3A CPU 的高性能BLAS 库.”
    • BLISlab: A Sandbox for Optimizing GEMM, https://github.com/ybNo1/blislab/tree/master

Buy me a cup of coffee ☕.