h100 cublasLt算力压测
h100 标称fp16 tensor core稀疏算力1,979TFLOPs, 稠密算力989.5TFLOPs,经过cublasLt压测,能达到最高965.8TFLOPs,压测结果截取:
I1008 17:50:48.624795 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 32768 ,N: 512 ,K: 32768 ,ta: 1 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 957.370452
I1008 17:50:14.535599 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 512 ,K: 32768 ,ta: 0 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 957.394501
I1008 17:50:25.955717 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 1024 ,K: 32768 ,ta: 0 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 961.762644
I1008 17:50:35.242605 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 512 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 962.997159
I1008 17:50:27.392100 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 32768 ,N: 512 ,K: 32768 ,ta: 0 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 963.499441
I1008 17:50:35.284884 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 1024 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 963.821024
I1008 17:50:35.360344 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 2048 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 965.835337
benchmark code:
https://github.com/icyhearts/cublas_learn/blob/master/gemm_benchmark_all_size.cu
压测代码
如何计算h100 peak performance
The volta whitepaper indicates explicitly that each TC unit in Volta delivers 64 FMA ops per clock (equals 128 FLOPs/clk). When looked at from an SM perspective, the SM as a whole (having 8 TC units) is capable of 1024 FLOPs/clk. This seems to line up with stated numbers for V100 FP16 TC throughput which vary over a range of approximately 112 to 130 TFLOP/s depending on sku/variant. Let’s convince ourselves of that. Considering the V100 PCIE with 80 SMs, this would be
80 x 1024 = 81920 FLOPs/clk
Dividing the stated 112TFLOP/s performance of V100 PCIE by that number:
112,000,000 MFLOP/s / 81920 FLOP/clk = 1367 Mclk/s = 1367MHz
Which is a clock rate that is in line with the stated boost clock of V100.
Moving on to Ampere A100, the whitepaper states that the A100 TC unit delivers 256 FMA ops/clk, and considered at the SM level (four 3rd gen TC units/SM) this translates to 1024 FMA ops/clk, or 2048 FLOPs/clk, a doubling of the TC throughput for FP16 (non-sparsity) when comparing a Volta SM to an Ampere SM, clock-for-clock. Likewise we can confirm the stated 312 TFLOP/s number for A100 with 108 SMs in a similar fashion:
108 x 2048 = 221,184 FLOP/clk
and
312,000,000 MFLOP/s / 221,184 FLOP/clk = 1410M clk/s = 1410MHz
which is again in line with the stated/published boost clock for the A100 GPU.
Moving on to Hopper H100, the whitepaper simply states that the per SM throughput is again doubled compared to Ampere. So we are now at 4096 FLOP/clk per SM.
The H100 PCIE has 114 SMs, so we get, per GPU:
114 x 4096 = 466,944 FLOP/clk
The stated perf is 756 TFLOP/s, so
756,000,000 MFLOP/s / 466,944 FLOP/clk = 1620M clk/s = 1620MHz
The H100 PCIE board specification lists a max boost frequency of 1755MHz.
But, as pointed out below, table 3 in the H100 white paper indicates that max boost clock for TC usage on H100 PCIE is 1620MHz. So this calculation lines up with the stated boost frequency.