Rotary Position Embeddings Benchmarks - Aggregated Results

This document combines benchmark results from multiple Rotary Position Embeddings implementations.

Combined Summary and Visualization

2025-10-30T15:53:49.568408 image/svg+xml Matplotlib v3.10.7, https://matplotlib.org/ cuda_B1_S128_H8_D64_R32 cuda_B1_S128_H8_D128_R64 cuda_B1_S128_H32_D64_R32 cuda_B1_S128_H32_D128_R64 cuda_B1_S512_H8_D64_R32 cuda_B1_S512_H8_D128_R64 cuda_B1_S512_H32_D64_R32 cuda_B1_S512_H32_D128_R64 cuda_B1_S2048_H8_D64_R32 cuda_B1_S2048_H8_D128_R64 cuda_B1_S2048_H32_D64_R32 cuda_B1_S2048_H32_D128_R64 cuda_B2_S128_H8_D64_R32 cuda_B2_S128_H8_D128_R64 cuda_B2_S128_H32_D64_R32 cuda_B2_S128_H32_D128_R64 cuda_B2_S512_H8_D64_R32 cuda_B2_S512_H8_D128_R64 cuda_B2_S512_H32_D64_R32 cuda_B2_S512_H32_D128_R64 cuda_B2_S2048_H8_D64_R32 cuda_B2_S2048_H8_D128_R64 cuda_B2_S2048_H32_D64_R32 cuda_B2_S2048_H32_D128_R64 Workload 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Latency P50 (ms) Attention Implementation Latency hf_kernels_rotary torch_eager
▶ code ▼ output ▶ uv-logs | Cell: combine | 4.37s | Raw
======================================================================
LOADING BENCHMARK DATA
======================================================================
✓ HF Kernels Rotary             : /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/3884170bda871392d403d55c822a8b7de8970f81c4733ae7630938c3bf0db88a
✓ PyTorch Rotary                : /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/abf801d6445dfa81a8dd7b2e6257930c39c18160a9b97a739858c3b244e16cc5

  ✓ Found HF Kernels Rotary
     Path: /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/3884170bda871392d403d55c822a8b7de8970f81c4733ae7630938c3bf0db88a/rotary.jsonl
  ✓ Found PyTorch Rotary
     Path: /__w/kernels-benchmarks/kernels-benchmarks/benches/rotary/impls/.uvnote/cache/abf801d6445dfa81a8dd7b2e6257930c39c18160a9b97a739858c3b244e16cc5/rotary.jsonl

======================================================================
Summary: 2 found, 0 skipped, 0 missing
======================================================================

COMBINED BENCHMARK SUMMARY

impl                     wl                  p50(ms)  ok
hf_kernels_rotary        cuda_B1_S128_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S128_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S128_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S128_H8_D64_R32     0.08  True
hf_kernels_rotary        cuda_B1_S2048_H32_D128_R64     0.26  True
hf_kernels_rotary        cuda_B1_S2048_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S2048_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S2048_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S512_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S512_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B1_S512_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B1_S512_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S128_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S128_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S128_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S128_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S2048_H32_D128_R64     0.85  True
hf_kernels_rotary        cuda_B2_S2048_H32_D64_R32     0.26  True
hf_kernels_rotary        cuda_B2_S2048_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S2048_H8_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S512_H32_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S512_H32_D64_R32     0.09  True
hf_kernels_rotary        cuda_B2_S512_H8_D128_R64     0.09  True
hf_kernels_rotary        cuda_B2_S512_H8_D64_R32     0.09  True
torch_eager              cuda_B1_S128_H32_D128_R64     0.22  True
torch_eager              cuda_B1_S128_H32_D64_R32     0.23  True
torch_eager              cuda_B1_S128_H8_D128_R64     0.23  True
torch_eager              cuda_B1_S128_H8_D64_R32     0.17  True
torch_eager              cuda_B1_S2048_H32_D128_R64     0.23  True
torch_eager              cuda_B1_S2048_H32_D64_R32     0.22  True
torch_eager              cuda_B1_S2048_H8_D128_R64     0.22  True
torch_eager              cuda_B1_S2048_H8_D64_R32     0.22  True
torch_eager              cuda_B1_S512_H32_D128_R64     0.22  True
torch_eager              cuda_B1_S512_H32_D64_R32     0.22  True
torch_eager              cuda_B1_S512_H8_D128_R64     0.22  True
torch_eager              cuda_B1_S512_H8_D64_R32     0.22  True
torch_eager              cuda_B2_S128_H32_D128_R64     0.22  True
torch_eager              cuda_B2_S128_H32_D64_R32     0.22  True
torch_eager              cuda_B2_S128_H8_D128_R64     0.22  True
torch_eager              cuda_B2_S128_H8_D64_R32     0.22  True
torch_eager              cuda_B2_S2048_H32_D128_R64     0.64  True
torch_eager              cuda_B2_S2048_H32_D64_R32     0.23  True
torch_eager              cuda_B2_S2048_H8_D128_R64     0.23  True
torch_eager              cuda_B2_S2048_H8_D64_R32     0.22  True
torch_eager              cuda_B2_S512_H32_D128_R64     0.22  True
torch_eager              cuda_B2_S512_H32_D64_R32     0.23  True
torch_eager              cuda_B2_S512_H8_D128_R64     0.22  True
torch_eager              cuda_B2_S512_H8_D64_R32     0.22  True

GENERATING COMBINED VISUALIZATION

Loaded 48 records
✓ Visualization saved as latency.svg
Saved latency.png
✓ Visualization saved as latency.svg
✓ SVG visualization ready!

ANALYSIS COMPLETE
Total implementations analyzed: 2

Implementations included:
  ✓ HF Kernels Rotary
  ✓ PyTorch Rotary
▶ UV Install Logs

Artifacts:

latency.svg
2025-10-30T15:53:49.568408 image/svg+xml Matplotlib v3.10.7, https://matplotlib.org/ cuda_B1_S128_H8_D64_R32 cuda_B1_S128_H8_D128_R64 cuda_B1_S128_H32_D64_R32 cuda_B1_S128_H32_D128_R64 cuda_B1_S512_H8_D64_R32 cuda_B1_S512_H8_D128_R64 cuda_B1_S512_H32_D64_R32 cuda_B1_S512_H32_D128_R64 cuda_B1_S2048_H8_D64_R32 cuda_B1_S2048_H8_D128_R64 cuda_B1_S2048_H32_D64_R32 cuda_B1_S2048_H32_D128_R64 cuda_B2_S128_H8_D64_R32 cuda_B2_S128_H8_D128_R64 cuda_B2_S128_H32_D64_R32 cuda_B2_S128_H32_D128_R64 cuda_B2_S512_H8_D64_R32 cuda_B2_S512_H8_D128_R64 cuda_B2_S512_H32_D64_R32 cuda_B2_S512_H32_D128_R64 cuda_B2_S2048_H8_D64_R32 cuda_B2_S2048_H8_D128_R64 cuda_B2_S2048_H32_D64_R32 cuda_B2_S2048_H32_D128_R64 Workload 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Latency P50 (ms) Attention Implementation Latency hf_kernels_rotary torch_eager