Flash Attention Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub
import subprocess

print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Thu Oct 30 15:52:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   30C    P0             75W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Flash Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.50s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_flash(q, k, v):
    qt, kt, vt = (x.transpose(1, 2).contiguous() for x in (q, k, v))
    with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.FLASH_ATTENTION):
        o = torch.nn.functional.scaled_dot_product_attention(qt, kt, vt)
    return o.transpose(1, 2).contiguous()


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="torch_flash_ma",
    impl_tags={"family": "torch-sdpa", "backend": "FLASH", "compile": "max-autotune"},
    impl_func=torch_flash,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.587ms       102.23%       3.587ms       3.587ms             1  
                                         torch_flash_ma         7.11%     370.236us        47.42%       2.468ms       2.468ms       0.000us         0.00%       3.549ms       3.549ms             1  
                     aten::scaled_dot_product_attention         0.85%      44.391us         4.44%     231.334us      77.111us       0.000us         0.00%       2.791ms     930.498us             3  
              aten::_scaled_dot_product_flash_attention         0.51%      26.381us         3.59%     186.943us      62.314us       0.000us         0.00%       2.791ms     930.498us             3  
                         aten::_flash_attention_forward         0.76%      39.658us         2.57%     134.002us      44.667us       2.791ms        79.55%       2.791ms     930.498us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.791ms        79.55%       2.791ms     930.498us             3  
                                       aten::contiguous         0.30%      15.641us        34.37%       1.789ms     149.098us       0.000us         0.00%     757.697us      63.141us            12  
                                            aten::clone         0.74%      38.596us        34.07%       1.774ms     147.794us       0.000us         0.00%     757.697us      63.141us            12  
                                            aten::copy_         1.78%      92.553us        31.63%       1.647ms     137.218us     717.505us        20.45%     757.697us      63.141us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     717.505us        20.45%     717.505us      59.792us            12  
                                Activity Buffer Request        27.90%       1.452ms        27.90%       1.452ms       1.452ms      40.192us         1.15%      40.192us      40.192us             1  
                                        aten::transpose         1.49%      77.390us         2.00%     104.302us       4.346us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.52%      26.912us         0.52%      26.912us       1.121us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.55%      28.453us         2.13%     110.953us       7.397us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.93%     100.211us         1.93%     100.211us       4.175us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         2.45%     127.363us         2.45%     127.363us       8.491us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.32%      16.580us         0.32%      16.580us       5.527us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.05%       2.441us         0.05%       2.441us       0.407us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.18%       9.241us         0.18%       9.241us       3.080us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        52.58%       2.737ms        52.58%       2.737ms       2.737ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.205ms
Self CUDA time total: 3.509ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.72%     248.136us        41.78%       2.196ms       2.196ms       0.000us         0.00%       3.803ms       3.803ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.759ms       100.28%       3.759ms       3.759ms             1  
                     aten::scaled_dot_product_attention         0.51%      26.852us         3.40%     178.734us      59.578us       0.000us         0.00%       2.990ms     996.607us             3  
              aten::_scaled_dot_product_flash_attention         0.35%      18.418us         2.89%     151.882us      50.627us       0.000us         0.00%       2.990ms     996.607us             3  
                         aten::_flash_attention_forward         0.65%      34.063us         2.10%     110.562us      36.854us       2.990ms        79.76%       2.990ms     996.607us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.990ms        79.76%       2.990ms     996.607us             3  
                                       aten::contiguous         0.19%      10.079us        32.75%       1.721ms     143.446us       0.000us         0.00%     813.629us      67.802us            12  
                                            aten::clone         0.54%      28.151us        32.56%       1.711ms     142.606us       0.000us         0.00%     813.629us      67.802us            12  
                                            aten::copy_         1.97%     103.281us        30.84%       1.621ms     135.084us     758.782us        20.24%     813.629us      67.802us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     758.782us        20.24%     758.782us      63.232us            12  
                                Activity Buffer Request        27.29%       1.434ms        27.29%       1.434ms       1.434ms      54.847us         1.46%      54.847us      54.847us             1  
                                        aten::transpose         0.98%      51.741us         1.34%      70.423us       2.934us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.36%      18.682us         0.36%      18.682us       0.778us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.38%      19.848us         1.54%      80.939us       5.396us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.45%      76.001us         1.45%      76.001us       3.167us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         2.04%     106.952us         2.04%     106.952us       7.130us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.26%      13.850us         0.26%      13.850us       4.617us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.04%       1.860us         0.04%       1.860us       0.310us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.760us         0.07%       3.760us       1.253us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.22%       3.060ms        58.22%       3.060ms       3.060ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.255ms
Self CUDA time total: 3.749ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.59%     242.054us        41.69%       2.201ms       2.201ms       0.000us         0.00%       3.795ms       3.795ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.746ms       100.27%       3.746ms       3.746ms             1  
                     aten::scaled_dot_product_attention         0.50%      26.150us         3.40%     179.413us      59.804us       0.000us         0.00%       2.957ms     985.581us             3  
              aten::_scaled_dot_product_flash_attention         0.35%      18.371us         2.90%     153.263us      51.088us       0.000us         0.00%       2.957ms     985.581us             3  
                         aten::_flash_attention_forward         0.64%      34.041us         2.11%     111.213us      37.071us       2.957ms        79.14%       2.957ms     985.581us             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       2.957ms        79.14%       2.957ms     985.581us             3  
                                       aten::contiguous         0.19%       9.991us        32.85%       1.734ms     144.489us       0.000us         0.00%     838.147us      69.846us            12  
                                            aten::clone         0.52%      27.541us        32.66%       1.724ms     143.657us       0.000us         0.00%     838.147us      69.846us            12  
                                            aten::copy_         1.47%      77.641us        30.91%       1.632ms     135.987us     779.363us        20.86%     838.147us      69.846us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     779.363us        20.86%     779.363us      64.947us            12  
                                Activity Buffer Request        27.89%       1.472ms        27.89%       1.472ms       1.472ms      58.784us         1.57%      58.784us      58.784us             1  
                                        aten::transpose         0.96%      50.819us         1.31%      69.110us       2.880us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.35%      18.291us         0.35%      18.291us       0.762us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.38%      20.141us         1.58%      83.392us       5.559us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.49%      78.782us         1.49%      78.782us       3.283us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         1.99%     104.800us         1.99%     104.800us       6.987us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.27%      14.320us         0.27%      14.320us       4.773us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.04%       1.870us         0.04%       1.870us       0.312us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.720us         0.07%       3.720us       1.240us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.31%       3.078ms        58.31%       3.078ms       3.078ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.279ms
Self CUDA time total: 3.736ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.47%     246.252us        42.66%       2.352ms       2.352ms       0.000us         0.00%       3.878ms       3.878ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       3.831ms       100.28%       3.831ms       3.831ms             1  
                     aten::scaled_dot_product_attention         0.47%      26.180us         3.22%     177.714us      59.238us       0.000us         0.00%       3.035ms       1.012ms             3  
              aten::_scaled_dot_product_flash_attention         0.34%      18.934us         2.75%     151.534us      50.511us       0.000us         0.00%       3.035ms       1.012ms             3  
                         aten::_flash_attention_forward         0.60%      33.169us         1.99%     109.931us      36.644us       3.035ms        79.45%       3.035ms       1.012ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.035ms        79.45%       3.035ms       1.012ms             3  
                                       aten::contiguous         0.19%      10.269us        34.14%       1.882ms     156.829us       0.000us         0.00%     843.264us      70.272us            12  
                                            aten::clone         0.51%      27.861us        33.95%       1.872ms     155.974us       0.000us         0.00%     843.264us      70.272us            12  
                                            aten::copy_         1.39%      76.612us        32.27%       1.779ms     148.225us     785.216us        20.55%     843.264us      70.272us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     785.216us        20.55%     785.216us      65.435us            12  
                                Activity Buffer Request        26.00%       1.433ms        26.00%       1.433ms       1.433ms      58.048us         1.52%      58.048us      58.048us             1  
                                        aten::transpose         0.90%      49.620us         1.24%      68.282us       2.845us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.34%      18.662us         0.34%      18.662us       0.778us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.37%      20.139us         1.52%      83.911us       5.594us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.44%      79.524us         1.44%      79.524us       3.313us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         5.29%     291.664us         5.29%     291.664us      19.444us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.25%      13.850us         0.25%      13.850us       4.617us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.810us         0.03%       1.810us       0.302us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       3.620us         0.07%       3.620us       1.207us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        57.34%       3.161ms        57.34%       3.161ms       3.161ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.512ms
Self CUDA time total: 3.820ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.69%     283.303us        42.14%       2.547ms       2.547ms       0.000us         0.00%       4.304ms       4.304ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       4.254ms       100.24%       4.254ms       4.254ms             1  
                     aten::scaled_dot_product_attention         0.82%      49.722us         3.53%     213.285us      71.095us       0.000us         0.00%       3.439ms       1.146ms             3  
              aten::_scaled_dot_product_flash_attention         0.34%      20.582us         2.71%     163.563us      54.521us       0.000us         0.00%       3.439ms       1.146ms             3  
                         aten::_flash_attention_forward         0.62%      37.231us         1.93%     116.771us      38.924us       3.439ms        81.02%       3.439ms       1.146ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.439ms        81.02%       3.439ms       1.146ms             3  
                                       aten::contiguous         0.18%      10.912us        32.97%       1.993ms     166.068us       0.000us         0.00%     865.695us      72.141us            12  
                                            aten::clone         0.50%      30.059us        32.79%       1.982ms     165.158us       0.000us         0.00%     865.695us      72.141us            12  
                                            aten::copy_         1.39%      83.902us        31.17%       1.884ms     157.000us     805.439us        18.98%     865.695us      72.141us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     805.439us        18.98%     805.439us      67.120us            12  
                                Activity Buffer Request        24.08%       1.456ms        24.08%       1.456ms       1.456ms      60.256us         1.42%      60.256us      60.256us             1  
                                        aten::transpose         1.06%      63.793us         1.39%      84.162us       3.507us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.34%      20.369us         0.34%      20.369us       0.849us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.36%      21.791us         1.46%      88.331us       5.889us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.33%      80.570us         1.33%      80.570us       3.357us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         6.09%     368.355us         6.09%     368.355us      24.557us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.25%      15.000us         0.25%      15.000us       5.000us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.990us         0.03%       1.990us       0.332us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       4.160us         0.07%       4.160us       1.387us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        57.86%       3.497ms        57.86%       3.497ms       3.497ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.045ms
Self CUDA time total: 4.244ms



======================================================================
PROFILE TRACE: torch_flash_ma | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         torch_flash_ma         4.04%     248.485us        39.71%       2.440ms       2.440ms       0.000us         0.00%       4.431ms       4.431ms             1  
                                         torch_flash_ma         0.00%       0.000us         0.00%       0.000us       0.000us       4.380ms       100.24%       4.380ms       4.380ms             1  
                     aten::scaled_dot_product_attention         0.42%      25.679us         2.90%     178.082us      59.361us       0.000us         0.00%       3.552ms       1.184ms             3  
              aten::_scaled_dot_product_flash_attention         0.29%      17.912us         2.48%     152.403us      50.801us       0.000us         0.00%       3.552ms       1.184ms             3  
                         aten::_flash_attention_forward         0.56%      34.360us         1.81%     111.452us      37.151us       3.552ms        81.28%       3.552ms       1.184ms             3  
void pytorch_flash::flash_fwd_kernel<Flash_fwd_kerne...         0.00%       0.000us         0.00%       0.000us       0.000us       3.552ms        81.28%       3.552ms       1.184ms             3  
                                       aten::contiguous         0.17%      10.359us        32.01%       1.967ms     163.915us       0.000us         0.00%     879.392us      73.283us            12  
                                            aten::clone         0.45%      27.371us        31.84%       1.957ms     163.052us       0.000us         0.00%     879.392us      73.283us            12  
                                            aten::copy_         1.33%      81.681us        30.34%       1.864ms     155.367us     818.048us        18.72%     879.392us      73.283us            12  
void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us     818.048us        18.72%     818.048us      68.171us            12  
                                Activity Buffer Request        23.48%       1.443ms        23.48%       1.443ms       1.443ms      61.344us         1.40%      61.344us      61.344us             1  
                                        aten::transpose         0.84%      51.433us         1.14%      69.901us       2.913us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::as_strided         0.30%      18.468us         0.30%      18.468us       0.769us       0.000us         0.00%       0.000us       0.000us            24  
                                       aten::empty_like         0.32%      19.754us         1.37%      83.993us       5.600us       0.000us         0.00%       0.000us       0.000us            15  
                                            aten::empty         1.26%      77.740us         1.26%      77.740us       3.239us       0.000us         0.00%       0.000us       0.000us            24  
                                       cudaLaunchKernel         5.92%     364.005us         5.92%     364.005us      24.267us       0.000us         0.00%       0.000us       0.000us            15  
                                    aten::empty_strided         0.23%      14.381us         0.23%      14.381us       4.794us       0.000us         0.00%       0.000us       0.000us             3  
                                 cudaDeviceGetAttribute         0.03%       1.840us         0.03%       1.840us       0.307us       0.000us         0.00%       0.000us       0.000us             6  
                                   cudaFuncSetAttribute         0.07%       4.180us         0.07%       4.180us       1.393us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        60.29%       3.705ms        60.29%       3.705ms       3.705ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.146ms
Self CUDA time total: 4.370ms


impl                     wl                  p50(ms)  ok
torch_flash_ma           cuda_attn_L128_bfloat16     1.22  True
torch_flash_ma           cuda_attn_L256_bfloat16     1.27  True
torch_flash_ma           cuda_attn_L320_bfloat16     1.29  True
torch_flash_ma           cuda_attn_L384_bfloat16     1.30  True
torch_flash_ma           cuda_attn_L448_bfloat16     1.45  True
torch_flash_ma           cuda_attn_L512_bfloat16     1.49  True
▶ UV Install Logs

Artifacts:

attention.jsonl