HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.23s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Thu Oct 30 15:52:16 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   29C    P0             86W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 4.17s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      78.752us      1953.17%      78.752us      78.752us             1  
                                      hf_kernels_swiglu         9.29%     160.875us        99.59%       1.725ms       1.725ms       0.000us         0.00%       5.440us       5.440us             1  
                      _activation_beeaae6::silu_and_mul         1.15%      19.839us        87.61%       1.518ms     505.995us       4.032us       100.00%       5.440us       1.813us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
                                Activity Buffer Request        83.97%       1.455ms        83.97%       1.455ms       1.455ms       1.408us        34.92%       1.408us       1.408us             1  
                                            aten::empty         2.69%      46.600us         2.69%      46.600us      15.533us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.49%      43.201us         2.49%      43.201us      14.400us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.41%       7.161us         0.41%       7.161us       7.161us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.733ms
Self CUDA time total: 4.032us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      62.528us      1575.81%      62.528us      62.528us             1  
                                      hf_kernels_swiglu         6.86%     110.833us        99.69%       1.610ms       1.610ms       0.000us         0.00%       5.312us       5.312us             1  
                      _activation_beeaae6::silu_and_mul         1.31%      21.159us        91.69%       1.481ms     493.565us       3.968us       100.00%       5.312us       1.771us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us       100.00%       3.968us       1.323us             3  
                                Activity Buffer Request        88.77%       1.434ms        88.77%       1.434ms       1.434ms       1.344us        33.87%       1.344us       1.344us             1  
                                            aten::empty         1.14%      18.330us         1.14%      18.330us       6.110us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.61%      26.001us         1.61%      26.001us       8.667us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.030us         0.31%       5.030us       5.030us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.615ms
Self CUDA time total: 3.968us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.232us      1291.50%      63.232us      63.232us             1  
                                      hf_kernels_swiglu         6.20%     101.121us        99.70%       1.627ms       1.627ms       0.000us         0.00%       6.528us       6.528us             1  
                      _activation_beeaae6::silu_and_mul         1.27%      20.780us        92.37%       1.507ms     502.489us       4.896us       100.00%       6.528us       2.176us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us       100.00%       4.896us       1.632us             3  
                                Activity Buffer Request        89.54%       1.461ms        89.54%       1.461ms       1.461ms       1.632us        33.33%       1.632us       1.632us             1  
                                            aten::empty         1.13%      18.440us         1.13%      18.440us       6.147us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.56%      25.391us         1.56%      25.391us       8.464us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       4.970us         0.30%       4.970us       4.970us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.632ms
Self CUDA time total: 4.896us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.664us      1554.55%      65.664us      65.664us             1  
                                      hf_kernels_swiglu         5.63%     101.442us        99.74%       1.798ms       1.798ms       0.000us         0.00%       5.632us       5.632us             1  
                      _activation_beeaae6::silu_and_mul         1.18%      21.341us        92.99%       1.677ms     558.850us       4.224us       100.00%       5.632us       1.877us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.224us       100.00%       4.224us       1.408us             3  
                                Activity Buffer Request        79.26%       1.429ms        79.26%       1.429ms       1.429ms       1.408us        33.33%       1.408us       1.408us             1  
                                            aten::empty         1.12%      20.239us         1.12%      20.239us       6.746us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        12.54%     226.164us        12.54%     226.164us      75.388us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.26%       4.649us         0.26%       4.649us       4.649us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.803ms
Self CUDA time total: 4.224us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.968us      1086.23%      63.968us      63.968us             1  
                                      hf_kernels_swiglu        19.44%      85.062us        98.79%     432.257us     432.257us       0.000us         0.00%       7.874us       7.874us             1  
                      _activation_beeaae6::silu_and_mul         4.74%      20.731us        74.99%     328.126us     109.375us       5.889us       100.00%       7.874us       2.625us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.889us       100.00%       5.889us       1.963us             3  
                                Activity Buffer Request        29.32%     128.302us        29.32%     128.302us     128.302us       1.985us        33.71%       1.985us       1.985us             1  
                                            aten::empty         4.36%      19.069us         4.36%      19.069us       6.356us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        40.93%     179.093us        40.93%     179.093us      59.698us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.21%       5.289us         1.21%       5.289us       5.289us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 437.546us
Self CUDA time total: 5.889us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.167us       867.45%      67.167us      67.167us             1  
                                      hf_kernels_swiglu         5.97%     103.951us        99.66%       1.736ms       1.736ms       0.000us         0.00%      10.335us      10.335us             1  
                      _activation_beeaae6::silu_and_mul         1.17%      20.451us        92.57%       1.612ms     537.363us       7.743us       100.00%      10.335us       3.445us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.743us       100.00%       7.743us       2.581us             3  
                                Activity Buffer Request        82.03%       1.429ms        82.03%       1.429ms       1.429ms       2.592us        33.48%       2.592us       2.592us             1  
                                            aten::empty         1.12%      19.510us         1.12%      19.510us       6.503us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.36%     162.983us         9.36%     162.983us      54.328us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.34%       5.970us         0.34%       5.970us       5.970us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.742ms
Self CUDA time total: 7.743us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.999us      1036.41%      67.999us      67.999us             1  
                                      hf_kernels_swiglu         5.88%     101.172us        99.74%       1.716ms       1.716ms       0.000us         0.00%       8.769us       8.769us             1  
                      _activation_beeaae6::silu_and_mul         1.20%      20.670us        92.73%       1.596ms     531.873us       6.561us       100.00%       8.769us       2.923us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.561us       100.00%       6.561us       2.187us             3  
                                Activity Buffer Request        82.56%       1.421ms        82.56%       1.421ms       1.421ms       2.208us        33.65%       2.208us       2.208us             1  
                                            aten::empty         1.13%      19.490us         1.13%      19.490us       6.497us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.96%     154.233us         8.96%     154.233us      51.411us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.26%       4.490us         0.26%       4.490us       4.490us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.721ms
Self CUDA time total: 6.561us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.295us       670.43%      63.295us      63.295us             1  
                                      hf_kernels_swiglu        23.24%      86.211us        98.67%     366.026us     366.026us       0.000us         0.00%      12.609us      12.609us             1  
                      _activation_beeaae6::silu_and_mul         5.71%      21.191us        70.40%     261.155us      87.052us       9.441us       100.00%      12.609us       4.203us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.441us       100.00%       9.441us       3.147us             3  
                                Activity Buffer Request        23.85%      88.481us        23.85%      88.481us      88.481us       3.168us        33.56%       3.168us       3.168us             1  
                                            aten::empty         5.03%      18.660us         5.03%      18.660us       6.220us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        40.84%     151.483us        40.84%     151.483us      50.494us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.33%       4.920us         1.33%       4.920us       4.920us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 370.946us
Self CUDA time total: 9.441us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.342us       500.47%      65.342us      65.342us             1  
                                      hf_kernels_swiglu        22.94%      96.471us        98.88%     415.727us     415.727us       0.000us         0.00%      17.408us      17.408us             1  
                      _activation_beeaae6::silu_and_mul         5.11%      21.490us        71.29%     299.725us      99.908us      13.056us       100.00%      17.408us       5.803us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.056us       100.00%      13.056us       4.352us             3  
                                Activity Buffer Request        30.59%     128.632us        30.59%     128.632us     128.632us       4.352us        33.33%       4.352us       4.352us             1  
                                            aten::empty         4.65%      19.531us         4.65%      19.531us       6.510us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.58%     149.603us        35.58%     149.603us      49.868us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.12%       4.720us         1.12%       4.720us       4.720us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 420.447us
Self CUDA time total: 13.056us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 14.50it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 20.28it/s]

Artifacts:

activation.jsonl