

### Profiling on HPC Systems

Presented by Nick Hagerty to CSE 6230 at GATech February 14, 2023 Contact: hagertynl@ornl.gov (HAGERTYNL@ORNL.gov)

ORNL is managed by UT-Battelle LLC for the US Department of Energy



### Who am I?



Me, June 2021

- Nick Hagerty, BS, MS in Computer Science, Miami (OH) University '21
- Interned at Air Force Research Laboratory in Dayton, OH in computational chemistry 2019-2021
- Joined Oak Ridge National Laboratory in June 2021



Mt LeConte, GSMNP, TN

COAK RIDGE LEADERSHIP

# What is profiling?

- Gathers information about the time and resources each routine within a program consumes
- Goals:
  - Identify resource-consuming routines to support improving the code base
  - Demonstrate program efficiency
- Available Methods (some of them):
  - Linux perf stat
  - AMD rocprof, omniperf
  - HPE perftools
  - NVIDIA nvprof, ncu
  - Others: HPCToolkit, Apex

Notation remark:

Flops, flops : floating-point operations

**FLOPS, FLOPs** : floating-point operations **per second** 

## Why profile?

- Moore's Law is at a transition
  - Impractical to build larger & larger machines – power, cooling, networking, space
- Application speed-up must rely less on hardware improvement, more on improving algorithms
  - Profiling & optimization



#### **Projected Performance Development**



## Where we're going with this

- CPU-based profiling: perf stat
- GPU-based profiling
  - Introduction: matrix addition
  - Basic Roofline model
  - Ramping up the flops: matrix multiplication
  - Improving matrix multiplication
  - Hierarchical Roofline model
- Demo GPU stencil



- Linux perf stat
  - Good for basic CPU-based profiling
- Test case: NxN square matrix addition, C = A+B
  - Row-major and column-major experiments
- Storing an array in memory:



• Column-major matrix addition:





• Row-major matrix addition:

- Memory reads are typically 64 bytes
  - Reading matrix[0] likely reads matrix[1], matrix[2], and matrix[3] as well

**CAK RIDGE** National Laboratory

8

• Linux – perf stat

• Good for basic CPU-based profiling

```
$ perf stat -d -d -d./matrix-add -m 8192 # These are for column-major
<application output>
  Performance counter stats for `./matrix-add -m 8192':
```

| 3,691.58 msec  | task-clock:u                         | # | 1.000   | CPUs utilized           |        |
|----------------|--------------------------------------|---|---------|-------------------------|--------|
| 0              | context-switches:u                   | # | 0.000   | /sec                    |        |
| 0              | cpu-migrations:u                     | # | 0.000   | /sec                    |        |
| 2,547          | page-faults:u                        | # | 689.949 | /sec                    |        |
| 11,573,436,803 | cycles:u                             | # | 3.135   | GHz                     | (83.32 |
| 211,027,746    | <pre>stalled-cycles-frontend:u</pre> | # | 1.82%   | frontend cycles idle    | (83.32 |
| 7,090,183      | <pre>stalled-cycles-backend:u</pre>  | # | 0.06%   | backend cycles idle     | (83.3  |
| 11,355,480,740 | instructions:u                       | # | 0.98    | insn per cycle          |        |
|                |                                      | # | 0.02    | stalled cycles per insn | (83.32 |
| 2,280,658,033  | branches:u                           | # | 617.801 | M/sec                   | (83.3  |
| 27,428         | branch-misses:u                      | # | 0.00%   | of all branches         | (83.3  |
|                |                                      |   |         |                         |        |

...many more metrics...

3.693409405 seconds time elapsed

3.576172000 seconds user 0.116005000 seconds sys



Profiling of row vs column-major matrix addition. 50% indicates identical values for both



## Checkpoint 1

- Any questions?
  - Matrix storage in memory
  - Column-major vs row-major addition
  - perf stat



### The architectures we'll use - GPU

- Profiling in these slides was done using one Graphics Compute Die (GCD) of an AMD MI250X
  - One AMD MI250X contains 2 GCDs, seen as logical GPUs by the runtime environment



MI250X Graphics Compute Die (GCD)



## Jumping off the deep end – GPU profiling

- GPU profilers analyze the exact instructions queued and resources consumed by a GPU kernel
- Starting at the bottom rocprof (AMD GPU)
  - Powerful GUI-less primitive profiler
- Each GPU vendor/architecture has a slightly different name for instructions

| Operation                      | AMD MI250X             |  |  |
|--------------------------------|------------------------|--|--|
| 64-bit floating point addition | SQ_INSTS_VALU_ADD_F64* |  |  |
| 32-bit floating point multiply | SQ_INSTS_VALU_MUL_F32* |  |  |
| 32-byte read from HBM          | TCC_EA_RDREQ_32B_sum** |  |  |

\* These instructions are per-wavefront, so they are multiplied by 64 \*\* There are multiple lanes that access HBM, so we sum across these lanes



- Matrix addition: C=A+B
  - Testing both column-major and row-major
- Storing a matrix in memory:





13

The kernel:

National Laboratory | FACILITY

```
// for an n x n square matrix
template<typename T>
 global void matrix add(const T* a, const T* b, T* c, int n,
                                                  int col major) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  if (row < n && col < n) {
    if (!col major) {
                                                             blockDim.x
      int x = row * n + col;
                                                                      blockldx.x
      c[x] = a[x] + b[x];
    } else {
      // then switch row & column
                                                  blockDim.y
      int x = col * n + row;
      c[x] = a[x] + b[x];
                                                                threadIdx.{x,y}
                                                     blockldx.y
```

How the kernel is launched for an  $n \times n$  matrix:

int tpb = 16; // tpb^2 must be < max\_threads\_per\_block (compiler flag)
int block\_size = ceil((double) n / (double) tpb);</pre>



- Matrix addition: C=A+B
  - Testing both column-major and row-major
- Gather all metrics for HBM and L2 cache, Vector-ALU FP64 usage Example rocprof input file (line-wrapped for viewing):
- pmc : TCC\_EA\_RDREQ\_32B\_sum TCC\_EA\_RDREQ\_sum TCC\_EA\_WRREQ\_64B\_sum SQ\_INSTS\_VALU\_ADD\_F64 SQ\_INSTS\_VALU\_MUL\_F64 SQ\_INSTS\_VALU\_FMA\_F64 SQ\_INSTS\_VALU\_TRANS\_F64 pmc : TCC\_READ\_sum TCC\_WRITE\_sum TCP\_TCC\_READ\_REQ\_sum TCP\_TCC\_WRITE\_REQ\_sum gpu : 0 kernel: matrix\_add

Each `pmc :` line generates 1 application re-run\*.

\*When the user wants more metrics than the profiler can handle at once, the application is re-run.

**COAK RIDGE** LEADERSHIP National Laboratory

Launch command: \$ srun -N 1 -n 1 --gpus=1 rocprof -i rocprof.input.txt \ --timestamp on -o profile.madd.csv ./matrix\_add\_gpu

Content of profile.madd.csv:

Index,KernelName,...,TCC\_EA\_RDREQ\_32B\_sum,TCC\_EA\_RDREQ\_sum,...
2,"void matrix\_add<double>(double const\*,...)",...,0,67114780,...
<one row for each kernel>

- What can we do with these results?
  - With simple kernels, validating that profiling matches expectation
    - If you use a 1GB matrix, make sure your bytes read is about 2x 1GB
    - Calculate floating-point performance (Flops per second)
  - Check caching, register pressure, shared memory usage
  - For complex kernels, roofline profiling is a good model of performance

#### **Validation**

#### Bytes Read from HBM:

ideal read: 2x 4096x4096 matrices of doubles = 268.4 MB // for A & B

#### Flops:

flops\_fp64 = 64 \* (SQ\_INSTS\_VALU\_ADD\_F64 + SQ\_INSTS\_VALU\_MUL\_F64 + SQ\_INSTS\_VALU\_TRANS\_F64 + 2\*SQ\_INSTS\_VALU\_FMA\_F64) time = (CompleteNs - BeginNs) / power(10, 9) flops\_per\_s = flops\_fp64 / time // avg\_over\_last\_3\_kernel\_invocations flops\_per\_s\_rowmajor = 20.93 GFLOPs flops\_per\_s\_colmajor = 16.16 GFLOPs

**COAK RIDGE** National Laboratory

- What conclusions can we draw from this?
  - Row-major matrix addition performed almost 30% better than columnmajor, but the bytes read from HBM were very similar – how can we explain this performance difference?

#### Action items:

- 1. We should look at the LDS, L1 & L2 cache activity now, since HBM usage doesn't show anything significant
- 2. We have no idea if 21 GFLOPs is any good on the current hardware a. This is often the case when profiling complex kernels



#### **Checking L2 cache operations**

#### Bytes to/from L2 cache controller:

ideal read: 2x 4096x4096 doubles = 268.4 MB // for A & B ideal write: 1x 4096x4096 doubles = 134.2 MB // for C

```
// compute number of bytes read/written
bytes_read* = 64 * TCP_TCC_READ_REQ_sum
bytes write = 64 * TCP_TCC_WRITE_REQ_sum
```

```
// avg over last 3 kernel invocations:
bytes_read_rowmajor = 353.1 MB
bytes_read_colmajor = 353.7 MB
```

bytes\_write\_rowmajor = 176.6 MB
bytes\_write\_colmajor = 282.5 MB

\*AMD MI250X can send 2 64-byte reads as one 128-byte read

row-major sends 30% of it's reads in 128-byte chunks, vs 0.01% for column-major



**COAK RIDGE** National Laboratory

## Introduction to Roofline Modeling

- A Roofline model[1] plots floating-point performance as a function of arithmetic (or operational) intensity
  - Your performance is dependent on required bytes from memory/cache
- Example: Crusher an AMD MI250Xpowered test & development system at ORNL[1]

COAK RIDGE LEADERSHIP

21



[1] Williams, S., et al. Communications of the ACM, volume 52, pages 65–76, April 2009.[2] https://docs.olcf.ornl.gov/systems/crusher\_quick\_start\_guide.html

## Introduction to Roofline Modeling – matrix addition

#### Flops:

flops\_fp64 = 64 \* (SQ\_INSTS\_VALU\_ADD\_F64 + SQ\_INSTS\_VALU\_MUL\_F64 +
 SQ\_INSTS\_VALU\_TRANS\_F64 + 2\*SQ\_INSTS\_VALU\_FMA\_F64)
time = (CompleteNs - BeginNs) / power(10, 9)

ArithmeticIntensity = flops\_fp64 / bytes\_total
flops\_per\_s = flops\_fp64 / time

| Alignment | AI (Flops/Byte) |       | Theoretical Peak<br>(GFLOPs) |
|-----------|-----------------|-------|------------------------------|
| Column    | 0.0408          | 16.16 | 65.28                        |
| Row       | 0.0419          | 20.93 | 67.04                        |



## Introduction to Roofline Modeling – matrix addition



| Alignment | Al<br>(Flops/Byte) | Performance<br>(GFLOPs) | Theoretical<br>Peak<br>(GFLOPs) |
|-----------|--------------------|-------------------------|---------------------------------|
| Column    | 0.0408             | 16.16                   | 65.28                           |
| Row       | 0.0419             | 20.93                   | 67.04                           |

- Achieved about 30% of peak at the given AI for row-major addition
- Theoretical peak determined by formula below:

National Laboratory | FACILITY

## Checkpoint 2

- Any questions?
  - Matrix addition on the GPU
  - rocprof introduction
  - Roofline model introduction



- Matrix addition: C = AxB
  - If A is  $n \ge k$  and B is  $k \ge m$ , then:
    - C is *n* x *m*, each position in C requires the sum of *k* multiplications (the dot product of each column of B, row of A)





26

National Laboratory | FACILITY

| N (N x N<br>matrix) | Al<br>(Flops/Byte) | Performance<br>(GFLOPs) | Roofline<br>Peak<br>(GFLOPs) |
|---------------------|--------------------|-------------------------|------------------------------|
| 1024                | 21.06              | 644.63                  | 23900                        |
| 2048                | 4.06               | 631.72                  | 6496                         |
| 4096                | 3.78               | 618.59                  | 6048                         |

- For a matrix multiply, this doesn't look very good – even at larger sizes, achieving <3% of device capability</li>
- Vendors will likely provide libraries for things they want to be highly optimized. Let's try one - rocBLAS



| N (N x N<br>matrix) | Al<br>(Flops/Byte) | Performance<br>(GFLOPs) | Theoretical<br>Peak<br>(GFLOPs) |
|---------------------|--------------------|-------------------------|---------------------------------|
| 1024                | 42.00              | 8822.32                 | 23900                           |
| 2048                | 47.44              | 11017.20                | 23900                           |
| 4096                | 35.77              | 12337.90                | 23900                           |

- This looks much better
- >10x floating-point performance
- >50% device peak at largest size

27

## Types of Flops – matrix multiplication

• The AMD MI250X provides *Matrix cores*, which are highly optimized to do matrix operations, instead of the standard vector-ALU

| Method  | ADD_F64* | MUL_F64 | FMA_F64  | TRANS_F64 | MFMA_MOPS_F64 | Total GFlops |
|---------|----------|---------|----------|-----------|---------------|--------------|
| rocBLAS | 0        | 0       | 0        | 0         | 4194304       | 1.07**       |
| naive   | 0        | 16384   | 16793600 | 0         | 0             | 2.15         |

\*all metric names above are prefixed with `SQ\_INSTS\_VALU\_` in `rocprof` \*\*F64 matrix operations are multiplied by 256 to compute the number of performed Flops

At its core, the naïve algorithm has the following line of code to compute a dot-product:

Acility

28

This is considered a fused multiply-add operation (double the Flops!!)

# Types of Flops – matrix multiplication

 The AMD MI250X provides Matrix cores, which are highly optimized to do matrix operations, instead of the standard vector-ALU

| Method  | ADD_F64* | MUL_F64 | FMA_F64  | 12ANS_F64                                                                                             | MFMA_MOPS_F64 | Total GFlops | S |  |
|---------|----------|---------|----------|-------------------------------------------------------------------------------------------------------|---------------|--------------|---|--|
| rocBLAS | 0        | 0       | 0        | Matrix cores have a higher peak<br>than vector-ALU can achieve –<br>roofline plot needs to be updated |               |              |   |  |
| naive   | 0        | 16384   | 16793600 |                                                                                                       |               |              |   |  |

\*all metric names above are prefixed with `SQ\_INSTS\_VALU\_` in `rocprof` \*\*F64 matrix operations are multiplied by 256 to compute the number of performed Flops

At its core, the naïve algorithm has the following line of code to compute a dot-product:

Acility

29

This is considered a fused multiply-add operation (double the Flops!!)



**CAK RIDGE** National Laboratory

30

## Quick note - Hierarchical Roofline models

- A Hierarchical Roofline model[1] does the same thing as a traditional roofline, but taking into account device caches
- In the rest of this talk, we're going to add the L2 cache in

31

National Laboratory

 All HBM traffic will pass through the L2 cache, so it is accounted for in that total



AK RIDGE LEADERSHIP [1] C. Yang, etc al. Concurrency Computational Pract and Exper 2020. doi: 10.1002/cpe.5547

## Checkpoint 3

- Any questions?
  - Matrix multiplication on the GPU
  - RocBLAS Matrix multiplication
  - AMD Matrix cores
  - Hierarchical Roofline model



## Profiling & Optimization Demo – stencil - image blurring

• 2-D spatial averaging





### Demo – stencil - image blurring

• CPU implementation:

National Laboratory

```
// iqnore edges
for (size t i = RADIUS + 1; i < n - RADIUS; i++) {
  for (size t j = RADIUS + 1; j < n - RADIUS; j++) {
    T host sum = 0;
    for (int i offset = -RADIUS; i offset <= RADIUS; i offset++) {</pre>
      for (int j offset = -RADIUS; j offset <= RADIUS; j offset++) {</pre>
        host sum += a host[(i + i offset) * n + (j + j offset)];
    T host avg = host sum / ((RADIUS * 2 + 1) * (RADIUS * 2 + 1));
    // use host avg to check correctness of GPU-generated answer
```

\*keywords like restrict and const are removed for viewing

### Demo – stencil - image blurring

• Naïve GPU implementation:

35

National Laboratory

```
template<typename T>
 global void image_blur(T* a, T* b, int n, int m) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  if (row > RADIUS && col > RADIUS && row < m - RADIUS && col < n - RADIUS) {
    for (int i offset = -RADIUS; i offset <= RADIUS; i offset++) {</pre>
      for (int j offset = -RADIUS; j offset <= RADIUS; j offset++) {</pre>
        b[row * n + col] += a[(row + i offset) * n + (col + j offset)];
    b[row * n + col] /= ((RADIUS * 2 + 1) * (RADIUS * 2 + 1));
```

\*keywords like restrict and const are removed for viewing

### Demo – stencil - image blurring

• Naïve GPU implementation:

36

National Laboratory | FACILITY

```
template<typename T>
__global__ void image_blur(T* a, T* b, int n, int m) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    For simplicity, disregard edges
    if (row > RADIUS && col > RADIUS && row < m - RADIUS && col < n - RADIUS)
    for (int i_offset = -RADIUS; i_offset <= RADIUS; i_offset++) {
        for (int j_offset = -RADIUS; j_offset <= RADIUS; j_offset++) {
            b[row * n + col] += a[(row + i_offset) * n + (col + j_offset)];
        Sum of area
        b[row * n + col] /= ((RADIUS * 2 + 1) * (RADIUS * 2 + 1));
    }
}
</pre>
```

\*keywords like \_\_\_\_\_restrict\_\_\_ and const are removed for viewing

• Results:

| Metric           | Value(Naïve) |
|------------------|--------------|
| Bytes_L2         | 9.13 GB      |
| L2 cache hit (%) | 29.4%        |
| AI_L2            | 0.68         |
| FLOPs            | 742 GF/s     |





## Profiling & Optimization Demo – stencil - image blurring

 Let's expand the picture... Notice how in each block, each space is read multiple times?



38

National Laboratory

## Profiling & Optimization Demo – stencil - image blurring

• Architecture diagram of a single Compute Unit:





39

## Profiling & Optimization Demo – stencil - image blurring

• How we store this in LDS

Load the entries for each block, plus their surrounding neighbors, into matrix in LDS





40

National Laboratory



\*keywords like restrict and const are removed for viewing

- Result:
  - Increased Arithmetic Intensity
    - 23% fewer bytes requested from L2 cache
  - Increased Flop rate

#### Exact same number of Flops

| Metric           | Naïve    | LDS-enabled |
|------------------|----------|-------------|
| Bytes_L2         | 9.13 GB  | 6.98 GB     |
| L2 cache hit (%) | 29.4%    | 38.4%       |
| AI_L2            | 0.68     | 0.89        |
| FLOPs            | 742 GF/s | 955 GF/s    |
| Bytes_LDS        | 0 B      | 5120 B      |



**CAK RIDGE** 

National Laboratory FACILITY

- Result:
  - Increased Arithmetic Intensity
    - 23% fewer bytes requested from L2 cache
  - Increased Flop rate
  - Exact same number of Flops

| Metric           | Naïve    | LDS-enabled    |
|------------------|----------|----------------|
| Bytes_L2         | 9.13 GB  | 6.98 GB        |
| L2 cache hit (%) | 29.4%    | 38.4%          |
| AI_L2            | 0.68     | 0.89           |
| FLOPs            | 742 GF/s | 955 GF/s       |
| Bytes_LDS        | 0 B      | 5120 B         |
|                  |          | Purconding FKP |



By sending 5KB to LDS, we saved 2GB of reads to L2

**CAK RIDGE** 

National Laboratory FACILITY

- Result:
  - Increased Arithmetic Intensity
    - 23% fewer bytes requested from L2 cache
  - Increased Flop rate

#### Exact same number of Flops

| Metric           | Naïve    | LDS-enabled |
|------------------|----------|-------------|
| Bytes_L2         | 9.13 GB  | 6.98 GB     |
| L2 cache hit (%) | 29.4%    | 38.4%       |
| AI_L2            | 0.68     | 0.89        |
| FLOPs            | 742 GF/s | 955 GF/s    |
| Bytes_LDS        | ОВ       | 5120 B      |



What happens if we increase the radius?

**CAK RIDGE** 

National Laboratory FACILITY

LEADERSHIP COMPUTING

- Result:
  - Higher AI for both naïve and LDSenabled
  - No performance gain for naïve kernel, drastically improved performance (+57%) for LDS-enabled kernel

| Metric              | LDS R=1      | LDS R=2      |
|---------------------|--------------|--------------|
| Bytes_L2            | 6.98 GB      | 7.52 GB      |
| L2 cache hit<br>(%) | 38.4%        | 42.6%        |
| AI_L2               | 0.89 Flops/B | 1.39 Flops/B |
| FLOPs               | 955 GF/s     | 1498 GF/s    |
| Bytes_LDS           | 5120 B       | 5632 B       |



**CAK RIDGE** 

National Laboratory FACILITY

LEADERSHIP COMPUTING

### Outcomes

- Some helpful tips to keep in mind:
  - Know your hardware!
    - Utilize LDS, matrix cores, whatever your architecture can provide you. Matrix cores may be specific to AMD, but LDS is common
  - Mind your memory
    - Minimizing off-chip (HBM) reads can help improve performance, but doesn't always tell the full picture. Feel free to check L2 cache as well. Try to use algorithms that take advantage of the storage of the data structure
  - Reduce the Flops
    - When possible, use algebra to simplify the math required (we didn't cover an example of this today)



### Internships & Jobs

- Pathways to Computing Internship:
  - <u>https://education.ornl.gov/pathways/</u>
- Find open job postings:
  - <u>https://jobs.ornl.gov</u>





## Frontier fun facts



## 1.1 Exa-FLOPS (floating-point operations)

# 1 quintillion (10<sup>18</sup>) calculations per second

### **<u>4 years</u>**, if everyone on the earth did 1 calculation per second

COAK RIDGE LEADERSHIP National Laboratory

AMD),

which

takes

Each cabinet weights 8,000 lbs – weight of 2 F150's (74 cabinets total)

\$0.03 out of every \$1 spent to power the machine, to cool it

6000 gallons of water moved per minute. Fills an Olympic-size swimming pool in 30 minutes

700 PB storage holds about 35x more data than the Library of Congress

COAK RIDGE LEADERSHIP