HF Kernels - Causal Conv1D

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.28s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Thu Oct 30 15:51:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   27C    P8             22W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Causal Conv1D Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 5.66s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the causal conv1d kernel
causal_conv1d = get_kernel("kernels-community/causal-conv1d")


def hf_kernels_causal_conv1d(input_tensor, weight, bias):
    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)


run_benchmark(
    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
    impl_name="hf_kernels_causal_conv1d",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_causal_conv1d,
)
Running causal_conv1d benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     148.031us      3643.39%     148.031us     148.031us             1  
                               hf_kernels_causal_conv1d         8.90%     165.322us        99.57%       1.851ms       1.851ms       0.000us         0.00%       5.503us       5.503us             1  
                                         CausalConv1dFn         5.85%     108.724us        90.68%       1.685ms     561.740us       0.000us         0.00%       5.503us       1.834us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.35%      25.159us        81.18%       1.509ms     502.865us       4.063us       100.00%       5.503us       1.834us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        77.32%       1.437ms        77.32%       1.437ms       1.437ms       1.440us        35.44%       1.440us       1.440us             1  
                                       aten::empty_like         0.95%      17.630us         3.65%      67.900us      22.633us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.70%      50.270us         2.70%      50.270us      16.757us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.50%      46.532us         2.50%      46.532us      15.511us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.43%       7.900us         0.43%       7.900us       7.900us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.858ms
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.926us      3229.86%     120.926us     120.926us             1  
                               hf_kernels_causal_conv1d         5.72%      96.561us        99.68%       1.683ms       1.683ms       0.000us         0.00%       4.992us       4.992us             1  
                                         CausalConv1dFn         4.27%      72.072us        93.97%       1.587ms     528.936us       0.000us         0.00%       4.992us       1.664us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.50%      25.350us        87.84%       1.483ms     494.459us       3.744us       100.00%       4.992us       1.664us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.744us       100.00%       3.744us       1.248us             3  
                                Activity Buffer Request        84.49%       1.427ms        84.49%       1.427ms       1.427ms       1.248us        33.33%       1.248us       1.248us             1  
                                       aten::empty_like         0.48%       8.160us         1.86%      31.360us      10.453us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.37%      23.200us         1.37%      23.200us       7.733us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.85%      31.292us         1.85%      31.292us      10.431us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.32%       5.320us         0.32%       5.320us       5.320us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.689ms
Self CUDA time total: 3.744us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.942us      3255.88%     122.942us     122.942us             1  
                               hf_kernels_causal_conv1d         6.02%     102.400us        99.66%       1.696ms       1.696ms       0.000us         0.00%       5.023us       5.023us             1  
                                         CausalConv1dFn         4.37%      74.304us        93.64%       1.594ms     531.323us       0.000us         0.00%       5.023us       1.674us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.51%      25.778us        87.51%       1.490ms     496.532us       3.776us       100.00%       5.023us       1.674us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.776us       100.00%       3.776us       1.259us             3  
                                Activity Buffer Request        84.19%       1.433ms        84.19%       1.433ms       1.433ms       1.247us        33.02%       1.247us       1.247us             1  
                                       aten::empty_like         0.48%       8.219us         1.77%      30.070us      10.023us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.28%      21.851us         1.28%      21.851us       7.284us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.81%      30.742us         1.81%      30.742us      10.247us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.34%       5.821us         0.34%       5.821us       5.821us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.702ms
Self CUDA time total: 3.776us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     154.975us      4105.30%     154.975us     154.975us             1  
                               hf_kernels_causal_conv1d         5.10%      97.113us        99.71%       1.897ms       1.897ms       0.000us         0.00%       5.022us       5.022us             1  
                                         CausalConv1dFn         5.06%      96.320us        94.60%       1.800ms     599.880us       0.000us         0.00%       5.022us       1.674us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.32%      25.153us        87.78%       1.670ms     556.640us       3.775us       100.00%       5.022us       1.674us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.775us       100.00%       3.775us       1.258us             3  
                                Activity Buffer Request        75.43%       1.435ms        75.43%       1.435ms       1.435ms       1.247us        33.03%       1.247us       1.247us             1  
                                       aten::empty_like         0.48%       9.119us         1.76%      33.400us      11.133us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.28%      24.281us         1.28%      24.281us       8.094us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        11.03%     209.783us        11.03%     209.783us      69.928us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.29%       5.600us         0.29%       5.600us       5.600us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.902ms
Self CUDA time total: 3.775us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     127.520us      2656.67%     127.520us     127.520us             1  
                               hf_kernels_causal_conv1d         5.48%     101.023us        99.67%       1.838ms       1.838ms       0.000us         0.00%       6.400us       6.400us             1  
                                         CausalConv1dFn         4.02%      74.081us        94.20%       1.737ms     579.070us       0.000us         0.00%       6.400us       2.133us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.41%      25.982us        88.51%       1.632ms     544.113us       4.800us       100.00%       6.400us       2.133us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.800us       100.00%       4.800us       1.600us             3  
                                Activity Buffer Request        78.02%       1.439ms        78.02%       1.439ms       1.439ms       1.600us        33.33%       1.600us       1.600us             1  
                                       aten::empty_like         0.45%       8.310us         1.67%      30.790us      10.263us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.22%      22.480us         1.22%      22.480us       7.493us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.08%     167.462us         9.08%     167.462us      55.821us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.33%       6.020us         0.33%       6.020us       6.020us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.844ms
Self CUDA time total: 4.800us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.208us      2446.36%     118.208us     118.208us             1  
                               hf_kernels_causal_conv1d        14.10%      77.840us        98.97%     546.449us     546.449us       0.000us         0.00%       6.464us       6.464us             1  
                                         CausalConv1dFn        13.03%      71.942us        84.87%     468.609us     156.203us       0.000us         0.00%       6.464us       2.155us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.50%      24.830us        66.59%     367.636us     122.545us       4.832us       100.00%       6.464us       2.155us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.832us       100.00%       4.832us       1.611us             3  
                                Activity Buffer Request        33.64%     185.743us        33.64%     185.743us     185.743us       1.632us        33.77%       1.632us       1.632us             1  
                                       aten::empty_like         1.44%       7.931us         5.26%      29.031us       9.677us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         3.82%      21.100us         3.82%      21.100us       7.033us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        28.45%     157.063us        28.45%     157.063us      52.354us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.03%       5.680us         1.03%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 552.129us
Self CUDA time total: 4.832us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.887us      1226.27%     129.887us     129.887us             1  
                               hf_kernels_causal_conv1d         5.23%      95.772us        99.69%       1.826ms       1.826ms       0.000us         0.00%      14.144us      14.144us             1  
                                         CausalConv1dFn         4.13%      75.612us        94.46%       1.730ms     576.726us       0.000us         0.00%      14.144us       4.715us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.41%      25.780us        88.71%       1.625ms     541.586us      10.592us       100.00%      14.144us       4.715us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.592us       100.00%      10.592us       3.531us             3  
                                Activity Buffer Request        78.55%       1.439ms        78.55%       1.439ms       1.439ms       3.552us        33.53%       3.552us       3.552us             1  
                                       aten::empty_like         0.48%       8.780us         1.63%      29.810us       9.937us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.15%      21.030us         1.15%      21.030us       7.010us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.75%     160.332us         8.75%     160.332us      53.444us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.650us         0.31%       5.650us       5.650us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.832ms
Self CUDA time total: 10.592us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     119.356us      1093.80%     119.356us     119.356us             1  
                               hf_kernels_causal_conv1d        19.79%      94.221us        98.72%     469.928us     469.928us       0.000us         0.00%      14.592us      14.592us             1  
                                         CausalConv1dFn        14.74%      70.172us        78.93%     375.707us     125.236us       0.000us         0.00%      14.592us       4.864us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.30%      25.240us        58.06%     276.375us      92.125us      10.912us       100.00%      14.592us       4.864us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.912us       100.00%      10.912us       3.637us             3  
                                Activity Buffer Request        19.79%      94.192us        19.79%      94.192us      94.192us       3.680us        33.72%       3.680us       3.680us             1  
                                       aten::empty_like         1.68%       7.980us         6.13%      29.160us       9.720us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.45%      21.180us         4.45%      21.180us       7.060us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.97%     156.943us        32.97%     156.943us      52.314us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.28%       6.090us         1.28%       6.090us       6.090us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 476.018us
Self CUDA time total: 10.912us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.375us      1178.71%     129.375us     129.375us             1  
                               hf_kernels_causal_conv1d         5.38%      99.351us        99.70%       1.840ms       1.840ms       0.000us         0.00%      14.656us      14.656us             1  
                                         CausalConv1dFn         4.01%      73.942us        94.32%       1.740ms     580.087us       0.000us         0.00%      14.656us       4.885us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.38%      25.552us        88.67%       1.636ms     545.346us      10.976us       100.00%      14.656us       4.885us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.976us       100.00%      10.976us       3.659us             3  
                                Activity Buffer Request        78.64%       1.451ms        78.64%       1.451ms       1.451ms       3.680us        33.53%       3.680us       3.680us             1  
                                       aten::empty_like         0.48%       8.800us         1.64%      30.280us      10.093us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.16%      21.480us         1.16%      21.480us       7.160us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.64%     159.392us         8.64%     159.392us      53.131us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.531us         0.30%       5.531us       5.531us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.845ms
Self CUDA time total: 10.976us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.679us      1104.47%     123.679us     123.679us             1  
                               hf_kernels_causal_conv1d        17.75%      87.860us        98.92%     489.618us     489.618us       0.000us         0.00%      14.974us      14.974us             1  
                                         CausalConv1dFn        14.77%      73.091us        81.17%     401.758us     133.919us       0.000us         0.00%      14.974us       4.991us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.42%      26.830us        60.45%     299.195us      99.732us      11.198us       100.00%      14.974us       4.991us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.198us       100.00%      11.198us       3.733us             3  
                                Activity Buffer Request        20.28%     100.392us        20.28%     100.392us     100.392us       3.776us        33.72%       3.776us       3.776us             1  
                                       aten::empty_like         1.69%       8.381us         5.95%      29.472us       9.824us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.26%      21.091us         4.26%      21.091us       7.030us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.75%     171.973us        34.75%     171.973us      57.324us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.08%       5.331us         1.08%       5.331us       5.331us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 494.949us
Self CUDA time total: 11.198us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     132.959us       264.31%     132.959us     132.959us             1  
                               hf_kernels_causal_conv1d         5.33%      97.801us        99.71%       1.830ms       1.830ms       0.000us         0.00%      83.968us      83.968us             1  
                                         CausalConv1dFn         4.03%      73.903us        94.38%       1.732ms     577.264us       0.000us         0.00%      83.968us      27.989us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.44%      26.339us        88.71%       1.628ms     542.606us      50.304us       100.00%      83.968us      27.989us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      50.304us       100.00%      50.304us      16.768us             3  
                                Activity Buffer Request        78.52%       1.441ms        78.52%       1.441ms       1.441ms      33.664us        66.92%      33.664us      33.664us             1  
                                       aten::empty_like         0.46%       8.510us         1.64%      30.070us      10.023us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.17%      21.560us         1.17%      21.560us       7.187us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.75%     160.594us         8.75%     160.594us      53.531us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.29%       5.400us         0.29%       5.400us       5.400us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.835ms
Self CUDA time total: 50.304us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.085us       244.46%     125.085us     125.085us             1  
                               hf_kernels_causal_conv1d        15.91%      74.080us        98.78%     459.898us     459.898us       0.000us         0.00%      85.694us      85.694us             1  
                                         CausalConv1dFn        15.58%      72.521us        82.87%     385.818us     128.606us       0.000us         0.00%      85.694us      28.565us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.92%      27.572us        61.05%     284.236us      94.745us      51.167us       100.00%      85.694us      28.565us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      51.167us       100.00%      51.167us      17.056us             3  
                                Activity Buffer Request        21.78%     101.412us        21.78%     101.412us     101.412us      34.527us        67.48%      34.527us      34.527us             1  
                                       aten::empty_like         1.68%       7.830us         6.24%      29.061us       9.687us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.56%      21.231us         4.56%      21.231us       7.077us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.35%     155.252us        33.35%     155.252us      51.751us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.22%       5.680us         1.22%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 465.578us
Self CUDA time total: 51.167us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.583us      3164.74%     123.583us     123.583us             1  
                               hf_kernels_causal_conv1d         8.70%      75.560us        99.36%     863.215us     863.215us       0.000us         0.00%       5.153us       5.153us             1  
                                         CausalConv1dFn         8.33%      72.353us        90.66%     787.655us     262.552us       0.000us         0.00%       5.153us       1.718us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         2.88%      25.000us        78.85%     685.062us     228.354us       3.905us       100.00%       5.153us       1.718us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.905us       100.00%       3.905us       1.302us             3  
                                Activity Buffer Request        57.61%     500.499us        57.61%     500.499us     500.499us       1.248us        31.96%       1.248us       1.248us             1  
                                       aten::empty_like         0.96%       8.370us         3.48%      30.240us      10.080us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.52%      21.870us         2.52%      21.870us       7.290us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        18.37%     159.563us        18.37%     159.563us      53.188us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.64%       5.560us         0.64%       5.560us       5.560us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 868.775us
Self CUDA time total: 3.905us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.845us      3044.19%     118.845us     118.845us             1  
                               hf_kernels_causal_conv1d        16.55%      74.260us        98.76%     443.077us     443.077us       0.000us         0.00%       5.152us       5.152us             1  
                                         CausalConv1dFn        15.87%      71.182us        82.21%     368.817us     122.939us       0.000us         0.00%       5.152us       1.717us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.48%      24.591us        59.34%     266.204us      88.735us       3.904us       100.00%       5.152us       1.717us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.904us       100.00%       3.904us       1.301us             3  
                                Activity Buffer Request        18.72%      83.961us        18.72%      83.961us      83.961us       1.248us        31.97%       1.248us       1.248us             1  
                                       aten::empty_like         1.83%       8.189us         7.01%      31.431us      10.477us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         5.18%      23.242us         5.18%      23.242us       7.747us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.14%     157.652us        35.14%     157.652us      52.551us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.24%       5.551us         1.24%       5.551us       5.551us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 448.628us
Self CUDA time total: 3.904us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.816us      3046.03%     122.816us     122.816us             1  
                               hf_kernels_causal_conv1d         8.66%      75.390us        99.38%     865.505us     865.505us       0.000us         0.00%       5.376us       5.376us             1  
                                         CausalConv1dFn         8.40%      73.201us        90.72%     790.115us     263.372us       0.000us         0.00%       5.376us       1.792us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.02%      26.261us        78.90%     687.193us     229.064us       4.032us       100.00%       5.376us       1.792us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
                                Activity Buffer Request        57.07%     497.089us        57.07%     497.089us     497.089us       1.344us        33.33%       1.344us       1.344us             1  
                                       aten::empty_like         0.93%       8.130us         3.41%      29.721us       9.907us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.48%      21.591us         2.48%      21.591us       7.197us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        18.81%     163.843us        18.81%     163.843us      54.614us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.62%       5.440us         0.62%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 870.945us
Self CUDA time total: 4.032us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     116.446us      2866.01%     116.446us     116.446us             1  
                               hf_kernels_causal_conv1d        16.24%      74.671us        98.84%     454.378us     454.378us       0.000us         0.00%       5.407us       5.407us             1  
                                         CausalConv1dFn        15.28%      70.221us        82.60%     379.707us     126.569us       0.000us         0.00%       5.407us       1.802us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.99%      27.540us        61.00%     280.405us      93.468us       4.063us       100.00%       5.407us       1.802us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        21.14%      97.192us        21.14%      97.192us      97.192us       1.344us        33.08%       1.344us       1.344us             1  
                                       aten::empty_like         1.73%       7.931us         6.33%      29.081us       9.694us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.60%      21.150us         4.60%      21.150us       7.050us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.86%     155.673us        33.86%     155.673us      51.891us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.16%       5.330us         1.16%       5.330us       5.330us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 459.708us
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     120.895us      2262.26%     120.895us     120.895us             1  
                               hf_kernels_causal_conv1d        10.03%      75.040us        99.26%     742.432us     742.432us       0.000us         0.00%       7.136us       7.136us             1  
                                         CausalConv1dFn         9.57%      71.601us        89.23%     667.392us     222.464us       0.000us         0.00%       7.136us       2.379us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.57%      26.722us        75.60%     565.480us     188.493us       5.344us       100.00%       7.136us       2.379us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.344us       100.00%       5.344us       1.781us             3  
                                Activity Buffer Request        50.95%     381.056us        50.95%     381.056us     381.056us       1.792us        33.53%       1.792us       1.792us             1  
                                       aten::empty_like         1.09%       8.161us         4.05%      30.311us      10.104us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.96%      22.150us         2.96%      22.150us       7.383us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        21.08%     157.702us        21.08%     157.702us      52.567us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.74%       5.510us         0.74%       5.510us       5.510us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 747.942us
Self CUDA time total: 5.344us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     114.428us      2091.54%     114.428us     114.428us             1  
                               hf_kernels_causal_conv1d        15.93%      72.612us        98.81%     450.477us     450.477us       0.000us         0.00%       7.327us       7.327us             1  
                                         CausalConv1dFn        15.28%      69.671us        82.88%     377.865us     125.955us       0.000us         0.00%       7.327us       2.442us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.81%      26.480us        61.42%     279.994us      93.331us       5.471us       100.00%       7.327us       2.442us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.471us       100.00%       5.471us       1.824us             3  
                                Activity Buffer Request        21.45%      97.772us        21.45%      97.772us      97.772us       1.856us        33.92%       1.856us       1.856us             1  
                                       aten::empty_like         1.75%       7.980us         6.19%      28.200us       9.400us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.44%      20.220us         4.44%      20.220us       6.740us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.16%     155.742us        34.16%     155.742us      51.914us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.19%       5.420us         1.19%       5.420us       5.420us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 455.897us
Self CUDA time total: 5.471us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.251us       717.80%     124.251us     124.251us             1  
                               hf_kernels_causal_conv1d        10.05%      75.520us        99.24%     745.563us     745.563us       0.000us         0.00%      23.101us      23.101us             1  
                                         CausalConv1dFn         9.33%      70.111us        89.19%     670.043us     223.348us       0.000us         0.00%      23.101us       7.700us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.43%      25.770us        75.92%     570.342us     190.114us      17.310us       100.00%      23.101us       7.700us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.310us       100.00%      17.310us       5.770us             3  
                                Activity Buffer Request        51.18%     384.497us        51.18%     384.497us     384.497us       5.791us        33.45%       5.791us       5.791us             1  
                                       aten::empty_like         1.14%       8.540us         3.94%      29.590us       9.863us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.80%      21.050us         2.80%      21.050us       7.017us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        21.31%     160.075us        21.31%     160.075us      53.358us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.76%       5.680us         0.76%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 751.243us
Self CUDA time total: 17.310us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     121.596us       682.20%     121.596us     121.596us             1  
                               hf_kernels_causal_conv1d        16.81%      75.551us        98.76%     443.797us     443.797us       0.000us         0.00%      23.808us      23.808us             1  
                                         CausalConv1dFn        15.22%      68.400us        81.95%     368.246us     122.749us       0.000us         0.00%      23.808us       7.936us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.83%      26.181us        60.07%     269.934us      89.978us      17.824us       100.00%      23.808us       7.936us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.824us       100.00%      17.824us       5.941us             3  
                                Activity Buffer Request        19.24%      86.441us        19.24%      86.441us      86.441us       5.984us        33.57%       5.984us       5.984us             1  
                                       aten::empty_like         1.76%       7.900us         6.66%      29.912us       9.971us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.90%      22.012us         4.90%      22.012us       7.337us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.01%     157.312us        35.01%     157.312us      52.437us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.24%       5.550us         1.24%       5.550us       5.550us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 449.347us
Self CUDA time total: 17.824us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.077us       686.13%     122.077us     122.077us             1  
                               hf_kernels_causal_conv1d        12.00%      91.181us        99.29%     754.243us     754.243us       0.000us         0.00%      23.808us      23.808us             1  
                                         CausalConv1dFn         9.45%      71.802us        87.29%     663.062us     221.021us       0.000us         0.00%      23.808us       7.936us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.27%      24.831us        73.88%     561.180us     187.060us      17.792us       100.00%      23.808us       7.936us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.792us       100.00%      17.792us       5.931us             3  
                                Activity Buffer Request        49.89%     378.947us        49.89%     378.947us     378.947us       6.016us        33.81%       6.016us       6.016us             1  
                                       aten::empty_like         1.06%       8.020us         3.96%      30.080us      10.027us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         2.90%      22.060us         2.90%      22.060us       7.353us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        20.72%     157.402us        20.72%     157.402us      52.467us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.71%       5.381us         0.71%       5.381us       5.381us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 759.624us
Self CUDA time total: 17.792us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.351us       671.15%     124.351us     124.351us             1  
                               hf_kernels_causal_conv1d        19.13%      92.321us        98.80%     476.748us     476.748us       0.000us         0.00%      24.736us      24.736us             1  
                                         CausalConv1dFn        14.83%      71.551us        79.67%     384.427us     128.142us       0.000us         0.00%      24.736us       8.245us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.89%      28.409us        58.58%     282.676us      94.225us      18.528us       100.00%      24.736us       8.245us             3  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.528us       100.00%      18.528us       6.176us             3  
                                Activity Buffer Request        20.26%      97.782us        20.26%      97.782us      97.782us       6.208us        33.51%       6.208us       6.208us             1  
                                       aten::empty_like         1.73%       8.360us         6.26%      30.200us      10.067us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.53%      21.840us         4.53%      21.840us       7.280us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.43%     156.485us        32.43%     156.485us      52.162us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.20%       5.770us         1.20%       5.770us       5.770us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 482.518us
Self CUDA time total: 18.528us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d         5.47%     101.271us        99.69%       1.845ms       1.845ms       0.000us         0.00%     162.913us     162.913us             1  
                                         CausalConv1dFn         4.05%      75.021us        94.22%       1.743ms     581.104us       0.000us         0.00%     162.913us      54.304us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.32%      24.372us        88.46%       1.637ms     545.603us      97.697us       100.00%     162.913us      54.304us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     139.807us       143.10%     139.807us     139.807us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.697us       100.00%      97.697us      32.566us             3  
                                Activity Buffer Request        78.43%       1.451ms        78.43%       1.451ms       1.451ms      65.216us        66.75%      65.216us      65.216us             1  
                                       aten::empty_like         0.45%       8.320us         1.70%      31.480us      10.493us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         1.25%      23.160us         1.25%      23.160us       7.720us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.71%     161.192us         8.71%     161.192us      53.731us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.721us         0.31%       5.721us       5.721us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.850ms
Self CUDA time total: 97.697us



======================================================================
PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                               hf_kernels_causal_conv1d        19.60%      95.701us        98.90%     482.848us     482.848us       0.000us         0.00%     163.744us     163.744us             1  
                                         CausalConv1dFn        15.21%      74.281us        79.29%     387.147us     129.049us       0.000us         0.00%     163.744us      54.581us             3  
              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.67%      27.701us        57.93%     282.846us      94.282us      98.688us       100.00%     163.744us      54.581us             3  
                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     139.968us       141.83%     139.968us     139.968us             1  
void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      98.688us       100.00%      98.688us      32.896us             3  
                                Activity Buffer Request        19.94%      97.362us        19.94%      97.362us      97.362us      65.056us        65.92%      65.056us      65.056us             1  
                                       aten::empty_like         1.68%       8.190us         6.15%      30.020us      10.007us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         4.47%      21.830us         4.47%      21.830us       7.277us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        32.32%     157.783us        32.32%     157.783us      52.594us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.10%       5.391us         1.10%       5.391us       5.391us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 488.239us
Self CUDA time total: 98.688us


impl                     wl                  p50(ms)  ok
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.05  True
hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
▶ UV Install Logs
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] Fetching 11 files: 9%|▉ | 1/11 [00:00<00:01, 6.41it/s] Fetching 11 files: 64%|██████▎ | 7/11 [00:01<00:00, 4.26it/s] Fetching 11 files: 100%|██████████| 11/11 [00:01<00:00, 6.78it/s]

Artifacts:

causal_conv1d.jsonl