{"id":618,"date":"2025-10-08T18:11:02","date_gmt":"2025-10-08T10:11:02","guid":{"rendered":"https:\/\/189505.xyz\/?p=618"},"modified":"2025-10-08T21:43:20","modified_gmt":"2025-10-08T13:43:20","slug":"gpu%e5%8e%8b%e6%b5%8b","status":"publish","type":"post","link":"https:\/\/189505.xyz\/?p=618","title":{"rendered":"GPU\u538b\u6d4b"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_40 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" area-label=\"ez-toc-toggle-icon-1\"><label for=\"item-69e01ab23645e\" aria-label=\"Table of Content\"><span style=\"display: flex;align-items: center;width: 35px;height: 30px;justify-content: center;direction:ltr;\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/label><input  type=\"checkbox\" id=\"item-69e01ab23645e\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/189505.xyz\/?p=618\/#h100_cublasLt%E7%AE%97%E5%8A%9B%E5%8E%8B%E6%B5%8B\" title=\"h100 cublasLt\u7b97\u529b\u538b\u6d4b\">h100 cublasLt\u7b97\u529b\u538b\u6d4b<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-1'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/189505.xyz\/?p=618\/#%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97h100_peak_performance\" title=\"\u5982\u4f55\u8ba1\u7b97h100 peak performance\">\u5982\u4f55\u8ba1\u7b97h100 peak performance<\/a><\/li><\/ul><\/nav><\/div>\n<h1><span class=\"ez-toc-section\" id=\"h100_cublasLt%E7%AE%97%E5%8A%9B%E5%8E%8B%E6%B5%8B\"><\/span>h100 cublasLt\u7b97\u529b\u538b\u6d4b<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>h100 \u6807\u79f0fp16 tensor core\u7a00\u758f\u7b97\u529b1,979TFLOPs, \u7a20\u5bc6\u7b97\u529b989.5TFLOPs\uff0c\u7ecf\u8fc7cublasLt\u538b\u6d4b\uff0c\u80fd\u8fbe\u5230\u6700\u9ad8965.8TFLOPs\uff0c\u538b\u6d4b\u7ed3\u679c\u622a\u53d6\uff1a<\/p>\n<pre><code>I1008 17:50:48.624795 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 32768 ,N: 512 ,K: 32768 ,ta: 1 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 957.370452\nI1008 17:50:14.535599 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 512 ,K: 32768 ,ta: 0 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 957.394501\nI1008 17:50:25.955717 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 1024 ,K: 32768 ,ta: 0 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 961.762644\nI1008 17:50:35.242605 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 512 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 962.997159\nI1008 17:50:27.392100 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 32768 ,N: 512 ,K: 32768 ,ta: 0 ,tb: 1 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 963.499441\nI1008 17:50:35.284884 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 1024 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 963.821024\nI1008 17:50:35.360344 2490133 gemm_benchmark_all_size.cu:229, do_one_bench] bench result M: 16384 ,N: 2048 ,K: 32768 ,ta: 1 ,tb: 0 ,abc_dtype: fp16 ,acc_dtype: fp32 ,tflops: 965.835337<\/code><\/pre>\n<p>benchmark code:<br \/>\n<a href=\"https:\/\/github.com\/icyhearts\/cublas_learn\/blob\/master\/gemm_benchmark_all_size.cu\">https:\/\/github.com\/icyhearts\/cublas_learn\/blob\/master\/gemm_benchmark_all_size.cu<\/a><br \/>\n\u538b\u6d4b\u4ee3\u7801<\/p>\n<h1><span class=\"ez-toc-section\" id=\"%E5%A6%82%E4%BD%95%E8%AE%A1%E7%AE%97h100_peak_performance\"><\/span>\u5982\u4f55\u8ba1\u7b97h100 peak performance<span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p><a href=\"https:\/\/forums.developer.nvidia.com\/t\/how-to-calculate-the-tensor-core-fp16-performance-of-h100\/244727\/2?utm_source=chatgpt.com\">https:\/\/forums.developer.nvidia.com\/t\/how-to-calculate-the-tensor-core-fp16-performance-of-h100\/244727\/2?utm_source=chatgpt.com<\/a><br \/>\n\u539f\u6587\u8bf4\u7684\uff1a<\/p>\n<pre><code>The volta whitepaper indicates explicitly that each TC unit in Volta delivers 64 FMA ops per clock (equals 128 FLOPs\/clk). When looked at from an SM perspective, the SM as a whole (having 8 TC units) is capable of 1024 FLOPs\/clk. This seems to line up with stated numbers for V100 FP16 TC throughput which vary over a range of approximately 112 to 130 TFLOP\/s depending on sku\/variant. Let\u2019s convince ourselves of that. Considering the V100 PCIE with 80 SMs, this would be\n\n80 x 1024 = 81920 FLOPs\/clk\nDividing the stated 112TFLOP\/s performance of V100 PCIE by that number:\n\n112,000,000 MFLOP\/s \/ 81920 FLOP\/clk = 1367 Mclk\/s = 1367MHz\nWhich is a clock rate that is in line with the stated boost clock of V100.\n\nMoving on to Ampere A100, the whitepaper states that the A100 TC unit delivers 256 FMA ops\/clk, and considered at the SM level (four 3rd gen TC units\/SM) this translates to 1024 FMA ops\/clk, or 2048 FLOPs\/clk, a doubling of the TC throughput for FP16 (non-sparsity) when comparing a Volta SM to an Ampere SM, clock-for-clock. Likewise we can confirm the stated 312 TFLOP\/s number for A100 with 108 SMs in a similar fashion:\n\n108 x 2048 = 221,184 FLOP\/clk\nand\n\n312,000,000 MFLOP\/s \/ 221,184 FLOP\/clk = 1410M clk\/s = 1410MHz\nwhich is again in line with the stated\/published boost clock for the A100 GPU.\n\nMoving on to Hopper H100, the whitepaper simply states that the per SM throughput is again doubled compared to Ampere. So we are now at 4096 FLOP\/clk per SM.\n\nThe H100 PCIE has 114 SMs, so we get, per GPU:\n\n114 x 4096 = 466,944 FLOP\/clk\nThe stated perf is 756 TFLOP\/s, so\n\n756,000,000 MFLOP\/s \/ 466,944 FLOP\/clk = 1620M clk\/s = 1620MHz\nThe H100 PCIE board specification lists a max boost frequency of 1755MHz.\n\nBut, as pointed out below, table 3 in the H100 white paper indicates that max boost clock for TC usage on H100 PCIE is 1620MHz. So this calculation lines up with the stated boost frequency.<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>h100 cublasLt\u7b97\u529b\u538b\u6d4b h100 \u6807\u79f0fp16 tensor core\u7a00\u758f\u7b97\u529b1,979TFLOP &#8230; <a title=\"GPU\u538b\u6d4b\" class=\"read-more\" href=\"https:\/\/189505.xyz\/?p=618\" aria-label=\"More on GPU\u538b\u6d4b\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/posts\/618"}],"collection":[{"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/189505.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=618"}],"version-history":[{"count":5,"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/posts\/618\/revisions"}],"predecessor-version":[{"id":623,"href":"https:\/\/189505.xyz\/index.php?rest_route=\/wp\/v2\/posts\/618\/revisions\/623"}],"wp:attachment":[{"href":"https:\/\/189505.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=618"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/189505.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=618"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/189505.xyz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=618"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}