Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.26s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Thu Oct 30 15:52:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   33C    P0            139W /  350W |       0MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.42s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.91%     153.364us        46.27%       1.815ms       1.815ms       0.000us         0.00%       3.039ms       3.039ms             1  
                                       aten::layer_norm         0.42%      16.299us        42.36%       1.661ms     553.716us       0.000us         0.00%       3.039ms       1.013ms             3  
                                aten::native_layer_norm         2.01%      79.002us        41.94%       1.645ms     548.283us       2.327ms       100.00%       3.039ms       1.013ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.329ms       100.06%       2.329ms       2.329ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.327ms       100.00%       2.327ms     775.829us             3  
                                Activity Buffer Request        37.33%       1.464ms        37.33%       1.464ms       1.464ms     711.872us        30.59%     711.872us     711.872us             1  
                                            aten::empty         1.19%      46.781us         1.19%      46.781us       5.198us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.21%      47.400us         1.21%      47.400us      15.800us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.20%       7.811us         0.20%       7.811us       1.302us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        53.73%       2.107ms        53.73%       2.107ms       2.107ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.922ms
Self CUDA time total: 2.327ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.15%      73.661us        25.36%       1.626ms       1.626ms       0.000us         0.00%       6.533ms       6.533ms             1  
                                       aten::layer_norm         0.14%       8.791us        24.21%       1.552ms     517.499us       0.000us         0.00%       6.533ms       2.178ms             3  
                                aten::native_layer_norm         0.79%      50.951us        24.07%       1.544ms     514.569us       4.920ms       100.00%       6.533ms       2.178ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.922ms       100.03%       4.922ms       4.922ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.920ms       100.00%       4.920ms       1.640ms             3  
                                Activity Buffer Request        22.34%       1.433ms        22.34%       1.433ms       1.433ms       1.613ms        32.78%       1.613ms       1.613ms             1  
                                            aten::empty         0.45%      28.941us         0.45%      28.941us       3.216us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.43%      27.430us         0.43%      27.430us       9.143us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.590us         0.06%       3.590us       0.598us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        74.64%       4.787ms        74.64%       4.787ms       4.787ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.413ms
Self CUDA time total: 4.920ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.10%      68.311us        26.09%       1.619ms       1.619ms       0.000us         0.00%       6.232ms       6.232ms             1  
                                       aten::layer_norm         0.13%       8.220us        24.99%       1.551ms     516.952us       0.000us         0.00%       6.232ms       2.077ms             3  
                                aten::native_layer_norm         0.83%      51.401us        24.86%       1.543ms     514.212us       4.714ms       100.00%       6.232ms       2.077ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.716ms       100.03%       4.716ms       4.716ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.714ms       100.00%       4.714ms       1.571ms             3  
                                Activity Buffer Request        23.07%       1.432ms        23.07%       1.432ms       1.432ms       1.518ms        32.20%       1.518ms       1.518ms             1  
                                            aten::empty         0.45%      27.641us         0.45%      27.641us       3.071us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.45%      27.961us         0.45%      27.961us       9.320us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.720us         0.06%       3.720us       0.620us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        73.91%       4.587ms        73.91%       4.587ms       4.587ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.206ms
Self CUDA time total: 4.714ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.61%      68.882us        14.40%       1.628ms       1.628ms       0.000us         0.00%      13.066ms      13.066ms             1  
                                       aten::layer_norm         0.08%       8.939us        13.79%       1.559ms     519.662us       0.000us         0.00%      13.066ms       4.355ms             3  
                                aten::native_layer_norm         0.44%      49.281us        13.71%       1.550ms     516.682us       9.830ms       100.00%      13.066ms       4.355ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.831ms       100.01%       9.831ms       9.831ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.830ms       100.00%       9.830ms       3.277ms             3  
                                Activity Buffer Request        11.27%       1.275ms        11.27%       1.275ms       1.275ms       3.236ms        32.92%       3.236ms       3.236ms             1  
                                            aten::empty         0.25%      28.400us         0.25%      28.400us       3.156us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.71%     193.833us         1.71%     193.833us      64.611us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.03%       3.811us         0.03%       3.811us       0.635us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        85.60%       9.678ms        85.60%       9.678ms       9.678ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 11.306ms
Self CUDA time total: 9.830ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S2048_D4096     0.82  True
torch_layer_norm         LN_B16_S2048_D8192     1.68  True
torch_layer_norm         LN_B16_S4096_D4096     1.61  True
torch_layer_norm         LN_B16_S4096_D8192     3.32  True
▶ UV Install Logs

Artifacts:

layer_norm.jsonl