HF Kernels - Rotary Position Embeddings

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.21s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Thu Oct 30 15:52:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   30C    P0             76W /  350W |       0MiB /  46068MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Rotary Embeddings Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.39s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the rotary kernel
rotary = get_kernel("kernels-community/rotary")


def hf_kernels_rotary(query, key, cos, sin, conj=False):
    rotary_dim = cos.shape[-1]

    # Clone to avoid modifying inputs
    q_out = query.clone()
    k_out = key.clone()

    # Apply rotation to query
    q1 = q_out[..., :rotary_dim]
    q2 = q_out[..., rotary_dim : 2 * rotary_dim]
    rotary.apply_rotary(q1, q2, cos, sin, q1, q2, conj)

    # Apply rotation to key
    k1 = k_out[..., :rotary_dim]
    k2 = k_out[..., rotary_dim : 2 * rotary_dim]
    rotary.apply_rotary(k1, k2, cos, sin, k1, k2, conj)

    return q_out, k_out


run_benchmark(
    kernel_type=KernelTypeEnum.ROTARY,
    impl_name="hf_kernels_rotary",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_rotary,
    dtype="float32",
)
Running rotary benchmark on cuda with 24 workloads.

======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S128_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     437.951us      1890.33%     437.951us     437.951us             1  
                                      hf_kernels_rotary        12.22%     256.435us        99.67%       2.092ms       2.092ms       0.000us         0.00%      24.448us      24.448us             1  
                          _rotary_dba7d1e::apply_rotary         2.70%      56.773us         5.22%     109.533us      18.255us      16.128us        69.61%      16.128us       2.688us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      16.128us        69.61%      16.128us       2.688us             6  
                                            aten::clone         2.06%      43.312us        79.20%       1.663ms     277.110us       0.000us         0.00%       8.320us       1.387us             6  
                                            aten::copy_         2.16%      45.349us        74.16%       1.557ms     259.469us       7.040us        30.39%       8.320us       1.387us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.040us        30.39%       7.040us       1.173us             6  
                                Activity Buffer Request        68.35%       1.435ms        68.35%       1.435ms       1.435ms       1.280us         5.52%       1.280us       1.280us             1  
                                    aten::empty_strided         2.98%      62.532us         2.98%      62.532us      10.422us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync         3.65%      76.672us         3.65%      76.672us      12.779us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         2.33%      48.990us         3.04%      63.719us       5.310us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.70%      14.729us         0.70%      14.729us       1.227us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.51%      52.760us         2.51%      52.760us       8.793us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.33%       6.840us         0.33%       6.840us       6.840us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.099ms
Self CUDA time total: 23.168us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S128_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     347.903us      1449.48%     347.903us     347.903us             1  
                                      hf_kernels_rotary         8.54%     161.773us        99.74%       1.890ms       1.890ms       0.000us         0.00%      25.314us      25.314us             1  
                          _rotary_dba7d1e::apply_rotary         2.18%      41.260us         4.61%      87.431us      14.572us      16.194us        67.47%      16.194us       2.699us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      16.194us        67.47%      16.194us       2.699us             6  
                                            aten::clone         1.21%      22.941us        84.30%       1.597ms     266.206us       0.000us         0.00%       9.120us       1.520us             6  
                                            aten::copy_         2.05%      38.809us        81.33%       1.541ms     256.844us       7.808us        32.53%       9.120us       1.520us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.808us        32.53%       7.808us       1.301us             6  
                                Activity Buffer Request        76.43%       1.448ms        76.43%       1.448ms       1.448ms       1.312us         5.47%       1.312us       1.312us             1  
                                    aten::empty_strided         1.75%      33.230us         1.75%      33.230us       5.538us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync         2.85%      54.092us         2.85%      54.092us       9.015us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.79%      33.972us         2.29%      43.382us       3.615us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.50%       9.410us         0.50%       9.410us       0.784us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.44%      46.171us         2.44%      46.171us       7.695us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.26%       4.990us         0.26%       4.990us       4.990us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.895ms
Self CUDA time total: 24.002us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S128_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     344.799us      1421.56%     344.799us     344.799us             1  
                                      hf_kernels_rotary         8.36%     157.652us        99.72%       1.880ms       1.880ms       0.000us         0.00%      25.535us      25.535us             1  
                          _rotary_dba7d1e::apply_rotary         2.20%      41.393us         4.58%      86.433us      14.405us      16.479us        67.94%      16.479us       2.747us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      16.479us        67.94%      16.479us       2.747us             6  
                                            aten::clone         1.19%      22.449us        84.54%       1.594ms     265.688us       0.000us         0.00%       9.056us       1.509us             6  
                                            aten::copy_         1.98%      37.391us        81.51%       1.537ms     256.168us       7.776us        32.06%       9.056us       1.509us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.776us        32.06%       7.776us       1.296us             6  
                                Activity Buffer Request        76.55%       1.443ms        76.55%       1.443ms       1.443ms       1.280us         5.28%       1.280us       1.280us             1  
                                    aten::empty_strided         1.84%      34.673us         1.84%      34.673us       5.779us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync         2.98%      56.200us         2.98%      56.200us       9.367us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.75%      32.991us         2.23%      42.120us       3.510us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.48%       9.129us         0.48%       9.129us       0.761us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.39%      45.040us         2.39%      45.040us       7.507us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.28%       5.250us         0.28%       5.250us       5.250us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.886ms
Self CUDA time total: 24.255us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S128_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     344.221us      1225.16%     344.221us     344.221us             1  
                                      hf_kernels_rotary         7.87%     162.633us        99.75%       2.060ms       2.060ms       0.000us         0.00%      29.824us      29.824us             1  
                          _rotary_dba7d1e::apply_rotary         1.96%      40.432us         4.15%      85.752us      14.292us      17.728us        63.10%      17.728us       2.955us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      17.728us        63.10%      17.728us       2.955us             6  
                                            aten::clone         1.05%      21.772us        85.59%       1.768ms     294.674us       0.000us         0.00%      12.096us       2.016us             6  
                                            aten::copy_         1.75%      36.131us        82.94%       1.713ms     285.533us      10.368us        36.90%      12.096us       2.016us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.368us        36.90%      10.368us       1.728us             6  
                                Activity Buffer Request        69.12%       1.428ms        69.12%       1.428ms       1.428ms       1.728us         6.15%       1.728us       1.728us             1  
                                    aten::empty_strided         1.60%      33.071us         1.60%      33.071us       5.512us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        12.07%     249.233us        12.07%     249.233us      41.539us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.63%      33.600us         2.13%      43.960us       3.663us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.50%      10.360us         0.50%      10.360us       0.863us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.19%      45.320us         2.19%      45.320us       7.553us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.25%       5.220us         0.25%       5.220us       5.220us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.066ms
Self CUDA time total: 28.096us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S512_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     345.758us      1419.83%     345.758us     345.758us             1  
                                      hf_kernels_rotary         7.72%     159.843us        99.76%       2.064ms       2.064ms       0.000us         0.00%      25.664us      25.664us             1  
                          _rotary_dba7d1e::apply_rotary         1.98%      40.892us         4.09%      84.633us      14.106us      16.544us        67.94%      16.544us       2.757us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      16.544us        67.94%      16.544us       2.757us             6  
                                            aten::clone         1.14%      23.531us        85.80%       1.775ms     295.882us       0.000us         0.00%       9.120us       1.520us             6  
                                            aten::copy_         1.76%      36.431us        83.03%       1.718ms     286.337us       7.808us        32.06%       9.120us       1.520us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.808us        32.06%       7.808us       1.301us             6  
                                Activity Buffer Request        69.77%       1.444ms        69.77%       1.444ms       1.444ms       1.312us         5.39%       1.312us       1.312us             1  
                                    aten::empty_strided         1.63%      33.740us         1.63%      33.740us       5.623us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        11.50%     237.923us        11.50%     237.923us      39.654us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.68%      34.750us         2.15%      44.540us       3.712us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.47%       9.790us         0.47%       9.790us       0.816us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.11%      43.741us         2.11%      43.741us       7.290us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.24%       4.890us         0.24%       4.890us       4.890us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.069ms
Self CUDA time total: 24.352us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S512_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     375.259us      1340.31%     375.259us     375.259us             1  
                                      hf_kernels_rotary         7.92%     165.422us        99.76%       2.085ms       2.085ms       0.000us         0.00%      29.790us      29.790us             1  
                          _rotary_dba7d1e::apply_rotary         2.01%      42.019us         4.24%      88.630us      14.772us      17.566us        62.74%      17.566us       2.928us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      17.566us        62.74%      17.566us       2.928us             6  
                                            aten::clone         1.13%      23.560us        85.51%       1.787ms     297.810us       0.000us         0.00%      12.224us       2.037us             6  
                                            aten::copy_         1.86%      38.872us        82.84%       1.731ms     288.508us      10.432us        37.26%      12.224us       2.037us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.432us        37.26%      10.432us       1.739us             6  
                                Activity Buffer Request        68.75%       1.437ms        68.75%       1.437ms       1.437ms       1.792us         6.40%       1.792us       1.792us             1  
                                    aten::empty_strided         1.54%      32.252us         1.54%      32.252us       5.375us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        12.23%     255.474us        12.23%     255.474us      42.579us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.66%      34.672us         2.10%      43.902us       3.658us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.44%       9.230us         0.44%       9.230us       0.769us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.23%      46.611us         2.23%      46.611us       7.769us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.24%       4.930us         0.24%       4.930us       4.930us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.090ms
Self CUDA time total: 27.998us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S512_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     346.557us       858.83%     346.557us     346.557us             1  
                                      hf_kernels_rotary         7.80%     160.642us        99.76%       2.055ms       2.055ms       0.000us         0.00%      43.200us      43.200us             1  
                          _rotary_dba7d1e::apply_rotary         2.00%      41.122us         4.23%      87.123us      14.521us      23.424us        58.05%      23.424us       3.904us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      23.424us        58.05%      23.424us       3.904us             6  
                                            aten::clone         1.11%      22.900us        85.69%       1.765ms     294.130us       0.000us         0.00%      19.776us       3.296us             6  
                                            aten::copy_         1.80%      37.091us        82.95%       1.708ms     284.737us      16.928us        41.95%      19.776us       3.296us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      16.928us        41.95%      16.928us       2.821us             6  
                                Activity Buffer Request        70.02%       1.442ms        70.02%       1.442ms       1.442ms       2.848us         7.06%       2.848us       2.848us             1  
                                    aten::empty_strided         1.62%      33.460us         1.62%      33.460us       5.577us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        11.13%     229.194us        11.13%     229.194us      38.199us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.60%      33.049us         2.04%      42.051us       3.504us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.44%       9.002us         0.44%       9.002us       0.750us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.23%      46.001us         2.23%      46.001us       7.667us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.24%       4.950us         0.24%       4.950us       4.950us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.060ms
Self CUDA time total: 40.352us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S512_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     349.374us       446.91%     349.374us     349.374us             1  
                                      hf_kernels_rotary         8.00%     163.391us        99.76%       2.039ms       2.039ms       0.000us         0.00%      90.720us      90.720us             1  
                                            aten::clone         1.09%      22.181us        85.39%       1.745ms     290.833us       0.000us         0.00%      52.224us       8.704us             6  
                                            aten::copy_         1.85%      37.761us        82.69%       1.690ms     281.650us      39.680us        50.76%      52.224us       8.704us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      39.680us        50.76%      39.680us       6.613us             6  
                          _rotary_dba7d1e::apply_rotary         2.10%      42.834us         4.25%      86.883us      14.481us      38.496us        49.24%      38.496us       6.416us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      38.496us        49.24%      38.496us       6.416us             6  
                                Activity Buffer Request        69.78%       1.426ms        69.78%       1.426ms       1.426ms      12.544us        16.05%      12.544us      12.544us             1  
                                    aten::empty_strided         1.61%      32.920us         1.61%      32.920us       5.487us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        11.06%     226.094us        11.06%     226.094us      37.682us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.62%      33.171us         2.12%      43.331us       3.611us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.50%      10.160us         0.50%      10.160us       0.847us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.16%      44.049us         2.16%      44.049us       7.341us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.24%       4.951us         0.24%       4.951us       4.951us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.044ms
Self CUDA time total: 78.176us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S2048_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     333.879us       824.19%     333.879us     333.879us             1  
                                      hf_kernels_rotary        18.73%     154.483us        99.41%     820.134us     820.134us       0.000us         0.00%      43.327us      43.327us             1  
                          _rotary_dba7d1e::apply_rotary         4.89%      40.361us        10.02%      82.702us      13.784us      23.422us        57.82%      23.422us       3.904us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      23.422us        57.82%      23.422us       3.904us             6  
                                            aten::clone         2.46%      20.259us        65.56%     540.868us      90.145us       0.000us         0.00%      19.905us       3.317us             6  
                                            aten::copy_         4.70%      38.811us        59.16%     488.099us      81.350us      17.088us        42.18%      19.905us       3.317us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      17.088us        42.18%      17.088us       2.848us             6  
                                Activity Buffer Request        27.39%     225.944us        27.39%     225.944us     225.944us       2.817us         6.95%       2.817us       2.817us             1  
                                    aten::empty_strided         3.94%      32.510us         3.94%      32.510us       5.418us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        27.07%     223.344us        27.07%     223.344us      37.224us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.93%      32.394us         5.10%      42.081us       3.507us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.17%       9.687us         1.17%       9.687us       0.807us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.13%      42.341us         5.13%      42.341us       7.057us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.59%       4.860us         0.59%       4.860us       4.860us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 824.994us
Self CUDA time total: 40.510us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S2048_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     338.778us       450.33%     338.778us     338.778us             1  
                                      hf_kernels_rotary        18.40%     151.937us        99.39%     820.824us     820.824us       0.000us         0.00%      85.723us      85.723us             1  
                                            aten::clone         2.47%      20.430us        65.45%     540.538us      90.090us       0.000us         0.00%      47.293us       7.882us             6  
                                            aten::copy_         4.41%      36.400us        59.08%     487.928us      81.321us      36.798us        48.92%      47.293us       7.882us             6  
                          _rotary_dba7d1e::apply_rotary         4.89%      40.390us        10.51%      86.760us      14.460us      38.430us        51.08%      38.430us       6.405us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      38.430us        51.08%      38.430us       6.405us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      36.798us        48.92%      36.798us       6.133us             6  
                                Activity Buffer Request        27.74%     229.134us        27.74%     229.134us     229.134us      10.495us        13.95%      10.495us      10.495us             1  
                                    aten::empty_strided         3.90%      32.180us         3.90%      32.180us       5.363us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        26.93%     222.394us        26.93%     222.394us      37.066us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.90%      32.180us         5.04%      41.589us       3.466us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.14%       9.409us         1.14%       9.409us       0.784us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.61%      46.370us         5.61%      46.370us       7.728us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.61%       5.040us         0.61%       5.040us       5.040us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 825.864us
Self CUDA time total: 75.228us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S2048_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     338.815us       244.98%     338.815us     338.815us             1  
                                      hf_kernels_rotary        17.96%     152.299us        99.45%     843.474us     843.474us       0.000us         0.00%     161.823us     161.823us             1  
                                            aten::clone         2.40%      20.339us        66.32%     562.460us      93.743us       0.000us         0.00%     102.176us      17.029us             6  
                                            aten::copy_         4.27%      36.251us        60.21%     510.629us      85.105us      78.656us        56.87%     102.176us      17.029us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      78.656us        56.87%      78.656us      13.109us             6  
                          _rotary_dba7d1e::apply_rotary         4.86%      41.202us        10.23%      86.763us      14.460us      59.647us        43.13%      59.647us       9.941us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      59.647us        43.13%      59.647us       9.941us             6  
                                Activity Buffer Request        30.37%     257.584us        30.37%     257.584us     257.584us      23.520us        17.01%      23.520us      23.520us             1  
                                    aten::empty_strided         3.71%      31.492us         3.71%      31.492us       5.249us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        25.56%     216.794us        25.56%     216.794us      36.132us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.89%      32.951us         4.95%      41.952us       3.496us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.06%       9.001us         1.06%       9.001us       0.750us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.37%      45.561us         5.37%      45.561us       7.594us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.55%       4.640us         0.55%       4.640us       4.640us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 848.114us
Self CUDA time total: 138.303us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B1_S2048_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary        12.84%     152.812us        71.89%     855.575us     855.575us       0.000us         0.00%     769.625us     769.625us             1  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     710.234us       101.16%     710.234us     710.234us             1  
                                            aten::clone         1.76%      21.001us        48.07%     572.021us      95.337us       0.000us         0.00%     572.987us      95.498us             6  
                                            aten::copy_         3.15%      37.471us        43.65%     519.450us      86.575us     505.436us        71.99%     572.987us      95.498us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     505.436us        71.99%     505.436us      84.239us             6  
                          _rotary_dba7d1e::apply_rotary         3.42%      40.722us         7.33%      87.262us      14.544us     196.638us        28.01%     196.638us      32.773us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us     196.638us        28.01%     196.638us      32.773us             6  
                                Activity Buffer Request        21.90%     260.665us        21.90%     260.665us     260.665us      67.551us         9.62%      67.551us      67.551us             1  
                                    aten::empty_strided         2.65%      31.570us         2.65%      31.570us       5.262us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        18.60%     221.314us        18.60%     221.314us      36.886us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         2.82%      33.601us         3.65%      43.480us       3.623us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.83%       9.879us         0.83%       9.879us       0.823us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         3.91%      46.540us         3.91%      46.540us       7.757us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        28.11%     334.485us        28.11%     334.485us     334.485us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.190ms
Self CUDA time total: 702.074us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S128_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     340.957us      1280.69%     340.957us     340.957us             1  
                                      hf_kernels_rotary        17.85%     154.192us        99.45%     858.915us     858.915us       0.000us         0.00%      27.935us      27.935us             1  
                          _rotary_dba7d1e::apply_rotary         4.82%      41.593us        10.09%      87.173us      14.529us      18.719us        70.31%      18.719us       3.120us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      18.719us        70.31%      18.719us       3.120us             6  
                                            aten::clone         2.51%      21.701us        66.67%     575.779us      95.963us       0.000us         0.00%       9.216us       1.536us             6  
                                            aten::copy_         4.05%      34.978us        60.54%     522.828us      87.138us       7.904us        29.69%       9.216us       1.536us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.904us        29.69%       7.904us       1.317us             6  
                                Activity Buffer Request        30.68%     265.004us        30.68%     265.004us     265.004us       1.312us         4.93%       1.312us       1.312us             1  
                                    aten::empty_strided         3.62%      31.250us         3.62%      31.250us       5.208us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        25.80%     222.846us        25.80%     222.846us      37.141us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.77%      32.522us         4.84%      41.771us       3.481us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.07%       9.249us         1.07%       9.249us       0.771us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.28%      45.580us         5.28%      45.580us       7.597us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.55%       4.760us         0.55%       4.760us       4.760us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 863.675us
Self CUDA time total: 26.623us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S128_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     331.838us      1247.93%     331.838us     331.838us             1  
                                      hf_kernels_rotary        18.40%     149.763us        99.33%     808.424us     808.424us       0.000us         0.00%      27.871us      27.871us             1  
                          _rotary_dba7d1e::apply_rotary         5.12%      41.640us        10.68%      86.941us      14.490us      18.879us        71.00%      18.879us       3.147us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      18.879us        71.00%      18.879us       3.147us             6  
                                            aten::clone         2.56%      20.830us        65.24%     531.000us      88.500us       0.000us         0.00%       8.992us       1.499us             6  
                                            aten::copy_         4.49%      36.550us        58.98%     480.009us      80.001us       7.712us        29.00%       8.992us       1.499us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       7.712us        29.00%       7.712us       1.285us             6  
                                Activity Buffer Request        28.18%     229.375us        28.18%     229.375us     229.375us       1.280us         4.81%       1.280us       1.280us             1  
                                    aten::empty_strided         3.71%      30.161us         3.71%      30.161us       5.027us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        26.30%     214.084us        26.30%     214.084us      35.681us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.92%      31.890us         5.00%      40.720us       3.393us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.08%       8.830us         1.08%       8.830us       0.736us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.57%      45.301us         5.57%      45.301us       7.550us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.67%       5.440us         0.67%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 813.864us
Self CUDA time total: 26.591us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S128_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     353.852us      1157.89%     353.852us     353.852us             1  
                                      hf_kernels_rotary         7.66%     156.034us        99.77%       2.033ms       2.033ms       0.000us         0.00%      32.320us      32.320us             1  
                          _rotary_dba7d1e::apply_rotary         2.04%      41.512us         4.26%      86.762us      14.460us      20.159us        65.97%      20.159us       3.360us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      20.159us        65.97%      20.159us       3.360us             6  
                                            aten::clone         1.10%      22.431us        85.66%       1.746ms     290.955us       0.000us         0.00%      12.161us       2.027us             6  
                                            aten::copy_         2.23%      45.431us        82.85%       1.688ms     281.408us      10.401us        34.03%      12.161us       2.027us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.401us        34.03%      10.401us       1.734us             6  
                                Activity Buffer Request        70.07%       1.428ms        70.07%       1.428ms       1.428ms       1.760us         5.76%       1.760us       1.760us             1  
                                    aten::empty_strided         1.71%      34.849us         1.71%      34.849us       5.808us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        10.54%     214.913us        10.54%     214.913us      35.819us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.68%      34.241us         2.20%      44.770us       3.731us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.52%      10.529us         0.52%      10.529us       0.877us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.22%      45.250us         2.22%      45.250us       7.542us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.23%       4.770us         0.23%       4.770us       4.770us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.038ms
Self CUDA time total: 30.560us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S128_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     367.612us       860.51%     367.612us     367.612us             1  
                                      hf_kernels_rotary         7.69%     158.003us        99.76%       2.050ms       2.050ms       0.000us         0.00%      45.568us      45.568us             1  
                          _rotary_dba7d1e::apply_rotary         2.04%      41.961us         4.25%      87.391us      14.565us      25.759us        60.30%      25.759us       4.293us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      25.759us        60.30%      25.759us       4.293us             6  
                                            aten::clone         1.11%      22.799us        84.82%       1.743ms     290.528us       0.000us         0.00%      19.809us       3.301us             6  
                                            aten::copy_         1.88%      38.712us        82.12%       1.688ms     281.267us      16.961us        39.70%      19.809us       3.301us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      16.961us        39.70%      16.961us       2.827us             6  
                                Activity Buffer Request        69.69%       1.432ms        69.69%       1.432ms       1.432ms       2.848us         6.67%       2.848us       2.848us             1  
                                    aten::empty_strided         1.59%      32.771us         1.59%      32.771us       5.462us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        10.54%     216.613us        10.54%     216.613us      36.102us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         2.51%      51.572us         3.00%      61.672us       5.139us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.49%      10.100us         0.49%      10.100us       0.842us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.21%      45.430us         2.21%      45.430us       7.572us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.24%       4.849us         0.24%       4.849us       4.849us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.055ms
Self CUDA time total: 42.720us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S512_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     347.614us      1135.14%     347.614us     347.614us             1  
                                      hf_kernels_rotary         7.64%     156.781us        99.77%       2.046ms       2.046ms       0.000us         0.00%      32.383us      32.383us             1  
                          _rotary_dba7d1e::apply_rotary         2.01%      41.122us         4.16%      85.392us      14.232us      20.223us        66.04%      20.223us       3.370us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      20.223us        66.04%      20.223us       3.370us             6  
                                            aten::clone         1.11%      22.841us        85.79%       1.759ms     293.223us       0.000us         0.00%      12.160us       2.027us             6  
                                            aten::copy_         1.81%      37.030us        83.06%       1.703ms     283.910us      10.400us        33.96%      12.160us       2.027us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.400us        33.96%      10.400us       1.733us             6  
                                Activity Buffer Request        70.68%       1.449ms        70.68%       1.449ms       1.449ms       1.760us         5.75%       1.760us       1.760us             1  
                                    aten::empty_strided         1.61%      33.040us         1.61%      33.040us       5.507us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        10.58%     216.984us        10.58%     216.984us      36.164us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.70%      34.784us         2.17%      44.532us       3.711us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.48%       9.748us         0.48%       9.748us       0.812us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         2.16%      44.270us         2.16%      44.270us       7.378us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.23%       4.760us         0.23%       4.760us       4.760us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.051ms
Self CUDA time total: 30.623us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S512_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     328.444us       771.23%     328.444us     328.444us             1  
                                      hf_kernels_rotary        18.84%     150.934us        99.38%     796.084us     796.084us       0.000us         0.00%      45.403us      45.403us             1  
                          _rotary_dba7d1e::apply_rotary         5.06%      40.529us        10.59%      84.811us      14.135us      25.693us        60.33%      25.693us       4.282us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      25.693us        60.33%      25.693us       4.282us             6  
                                            aten::clone         2.49%      19.929us        64.90%     519.868us      86.645us       0.000us         0.00%      19.710us       3.285us             6  
                                            aten::copy_         4.41%      35.321us        58.57%     469.148us      78.191us      16.894us        39.67%      19.710us       3.285us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      16.894us        39.67%      16.894us       2.816us             6  
                                Activity Buffer Request        27.59%     221.013us        27.59%     221.013us     221.013us       2.816us         6.61%       2.816us       2.816us             1  
                                    aten::empty_strided         3.84%      30.791us         3.84%      30.791us       5.132us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        26.57%     212.814us        26.57%     212.814us      35.469us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.92%      31.361us         5.05%      40.471us       3.373us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.14%       9.110us         1.14%       9.110us       0.759us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.53%      44.282us         5.53%      44.282us       7.380us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.62%       4.951us         0.62%       4.951us       4.951us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 801.035us
Self CUDA time total: 42.587us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S512_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     338.910us       380.70%     338.910us     338.910us             1  
                                      hf_kernels_rotary        14.14%     150.935us        99.54%       1.062ms       1.062ms       0.000us         0.00%     104.734us     104.734us             1  
                                            aten::clone         2.00%      21.371us        73.53%     784.703us     130.784us       0.000us         0.00%      63.775us      10.629us             6  
                                            aten::copy_         3.58%      38.219us        68.59%     731.952us     121.992us      48.063us        53.99%      63.775us      10.629us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      48.063us        53.99%      48.063us       8.010us             6  
                          _rotary_dba7d1e::apply_rotary         3.85%      41.059us         8.05%      85.950us      14.325us      40.959us        46.01%      40.959us       6.826us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      40.959us        46.01%      40.959us       6.826us             6  
                                Activity Buffer Request        44.86%     478.699us        44.86%     478.699us     478.699us      15.712us        17.65%      15.712us      15.712us             1  
                                    aten::empty_strided         2.94%      31.380us         2.94%      31.380us       5.230us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        20.15%     215.034us        20.15%     215.034us      35.839us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         2.96%      31.591us         3.81%      40.690us       3.391us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.85%       9.099us         0.85%       9.099us       0.758us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         4.21%      44.891us         4.21%      44.891us       7.482us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.46%       4.900us         0.46%       4.900us       4.900us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.067ms
Self CUDA time total: 89.022us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S512_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     336.057us       230.21%     336.057us     336.057us             1  
                                      hf_kernels_rotary        18.72%     149.775us        99.40%     795.224us     795.224us       0.000us         0.00%     169.949us     169.949us             1  
                                            aten::clone         2.52%      20.180us        65.04%     520.348us      86.725us       0.000us         0.00%     106.527us      17.755us             6  
                                            aten::copy_         4.49%      35.890us        58.61%     468.868us      78.145us      82.559us        56.55%     106.527us      17.755us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      82.559us        56.55%      82.559us      13.760us             6  
                          _rotary_dba7d1e::apply_rotary         5.12%      40.981us        10.49%      83.942us      13.990us      63.422us        43.45%      63.422us      10.570us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      63.422us        43.45%      63.422us      10.570us             6  
                                Activity Buffer Request        27.82%     222.544us        27.82%     222.544us     222.544us      23.968us        16.42%      23.968us      23.968us             1  
                                    aten::empty_strided         3.91%      31.300us         3.91%      31.300us       5.217us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        26.30%     210.434us        26.30%     210.434us      35.072us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.94%      31.518us         5.14%      41.159us       3.430us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.21%       9.641us         1.21%       9.641us       0.803us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.37%      42.961us         5.37%      42.961us       7.160us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.60%       4.790us         0.60%       4.790us       4.790us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 800.014us
Self CUDA time total: 145.981us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S2048_H8_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     339.836us       451.90%     339.836us     339.836us             1  
                                      hf_kernels_rotary        18.57%     150.269us        99.38%     804.154us     804.154us       0.000us         0.00%      81.986us      81.986us             1  
                          _rotary_dba7d1e::apply_rotary         4.99%      40.401us        10.49%      84.862us      14.144us      41.601us        55.32%      41.601us       6.933us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      41.601us        55.32%      41.601us       6.933us             6  
                                            aten::clone         2.54%      20.532us        64.81%     524.439us      87.406us       0.000us         0.00%      40.385us       6.731us             6  
                                            aten::copy_         4.41%      35.708us        58.24%     471.217us      78.536us      33.601us        44.68%      40.385us       6.731us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      33.601us        44.68%      33.601us       5.600us             6  
                                Activity Buffer Request        27.71%     224.174us        27.71%     224.174us     224.174us       6.784us         9.02%       6.784us       6.784us             1  
                                    aten::empty_strided         4.04%      32.690us         4.04%      32.690us       5.448us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        26.12%     211.335us        26.12%     211.335us      35.223us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         4.32%      34.924us         5.51%      44.584us       3.715us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.19%       9.660us         1.19%       9.660us       0.805us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         5.49%      44.461us         5.49%      44.461us       7.410us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.62%       4.981us         0.62%       4.981us       4.981us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 809.135us
Self CUDA time total: 75.202us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S2048_H8_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     372.859us       256.14%     372.859us     372.859us             1  
                                      hf_kernels_rotary        18.64%     161.451us        99.43%     861.125us     861.125us       0.000us         0.00%     169.279us     169.279us             1  
                                            aten::clone         2.47%      21.401us        63.58%     550.631us      91.772us       0.000us         0.00%     105.373us      17.562us             6  
                                            aten::copy_         4.30%      37.239us        57.31%     496.359us      82.727us      81.662us        56.10%     105.373us      17.562us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      81.662us        56.10%      81.662us      13.610us             6  
                          _rotary_dba7d1e::apply_rotary         5.12%      44.341us        12.24%     106.023us      17.671us      63.906us        43.90%      63.906us      10.651us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us      63.906us        43.90%      63.906us      10.651us             6  
                                Activity Buffer Request        28.62%     247.854us        28.62%     247.854us     247.854us      23.711us        16.29%      23.711us      23.711us             1  
                                    aten::empty_strided         3.80%      32.871us         3.80%      32.871us       5.479us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        24.39%     211.266us        24.39%     211.266us      35.211us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         3.88%      33.609us         4.97%      43.020us       3.585us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         1.09%       9.411us         1.09%       9.411us       0.784us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         7.12%      61.682us         7.12%      61.682us      10.280us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize         0.57%       4.969us         0.57%       4.969us       4.969us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 866.094us
Self CUDA time total: 145.568us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S2048_H32_D64_R32
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary        13.02%     148.583us        72.32%     825.404us     825.404us       0.000us         0.00%     745.510us     745.510us             1  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us     687.015us       101.19%     687.015us     687.015us             1  
                                            aten::clone         1.76%      20.130us        47.96%     547.368us      91.228us       0.000us         0.00%     556.292us      92.715us             6  
                                            aten::copy_         3.18%      36.280us        43.26%     493.818us      82.303us     489.699us        72.13%     556.292us      92.715us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     489.699us        72.13%     489.699us      81.617us             6  
                          _rotary_dba7d1e::apply_rotary         3.57%      40.732us         7.58%      86.552us      14.425us     189.218us        27.87%     189.218us      31.536us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us     189.218us        27.87%     189.218us      31.536us             6  
                                Activity Buffer Request        21.89%     249.905us        21.89%     249.905us     249.905us      66.593us         9.81%      66.593us      66.593us             1  
                                    aten::empty_strided         2.93%      33.420us         2.93%      33.420us       5.570us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync        18.19%     207.633us        18.19%     207.633us      34.606us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         2.92%      33.351us         3.76%      42.901us       3.575us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.84%       9.550us         0.84%       9.550us       0.796us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         4.01%      45.820us         4.01%      45.820us       7.637us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        27.68%     315.986us        27.68%     315.986us     315.986us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.141ms
Self CUDA time total: 678.917us



======================================================================
PROFILE TRACE: hf_kernels_rotary | cuda_B2_S2048_H32_D128_R64
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_rotary         5.26%     153.062us        28.60%     832.074us     832.074us       0.000us         0.00%       2.627ms       2.627ms             1  
                                      hf_kernels_rotary         0.00%       0.000us         0.00%       0.000us       0.000us       2.451ms       100.32%       2.451ms       2.451ms             1  
                                            aten::clone         0.71%      20.751us        18.92%     550.432us      91.739us       0.000us         0.00%       1.403ms     233.752us             6  
                                            aten::copy_         1.33%      38.628us        17.10%     497.389us      82.898us       1.219ms        49.87%       1.403ms     233.752us             6  
                          _rotary_dba7d1e::apply_rotary         1.42%      41.449us         2.89%      84.050us      14.008us       1.225ms        50.13%       1.225ms     204.141us             6  
void at::native::(anonymous namespace)::unrolled_ele...         0.00%       0.000us         0.00%       0.000us       0.000us       1.225ms        50.13%       1.225ms     204.141us             6  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.219ms        49.87%       1.219ms     203.112us             6  
                                Activity Buffer Request         8.62%     250.725us         8.62%     250.725us     250.725us     183.838us         7.52%     183.838us     183.838us             1  
                                    aten::empty_strided         1.11%      32.292us         1.11%      32.292us       5.382us       0.000us         0.00%       0.000us       0.000us             6  
                                        cudaMemcpyAsync         7.15%     208.036us         7.15%     208.036us      34.673us       0.000us         0.00%       0.000us       0.000us             6  
                                            aten::slice         1.18%      34.219us         1.53%      44.530us       3.711us       0.000us         0.00%       0.000us       0.000us            12  
                                       aten::as_strided         0.35%      10.311us         0.35%      10.311us       0.859us       0.000us         0.00%       0.000us       0.000us            12  
                                       cudaLaunchKernel         1.46%      42.601us         1.46%      42.601us       7.100us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        71.40%       2.077ms        71.40%       2.077ms       2.077ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.909ms
Self CUDA time total: 2.444ms


impl                     wl                  p50(ms)  ok
hf_kernels_rotary        cuda_B1_S128_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S128_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S128_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S128_H8_D64_R32     0.08  True
hf_kernels_rotary        cuda_B1_S2048_H32_D128_R64     0.26  True
hf_kernels_rotary        cuda_B1_S2048_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S2048_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S2048_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S512_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S512_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S512_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S512_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S128_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S128_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S128_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S128_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S2048_H32_D128_R64     0.85  True
hf_kernels_rotary        cuda_B2_S2048_H32_D64_R32     0.26  True
hf_kernels_rotary        cuda_B2_S2048_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S2048_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S512_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S512_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S512_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S512_H8_D64_R32     0.09  True
▶ UV Install Logs
Fetching 5 files: 0%| | 0/5 [00:00<?, ?it/s] Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 15.33it/s] Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 15.31it/s]

Artifacts:

rotary.jsonl