All Benchmarks Aggregated Report

Layer Norm

Layer Norm Latency
Implementation Description
HF Kernels Layer Norm HuggingFace kernels implementation
PyTorch Layer Norm PyTorch native implementation

Rotary Position Embeddings

Rotary Position Embeddings Latency
Implementation Description
HF Kernels Rotary HuggingFace kernels implementation
PyTorch Rotary PyTorch native implementation

Flash Attention

Flash Attention Latency
Implementation Description
Flash Attention Flash Attention implementation
HF Kernels Flash Attention HuggingFace kernels Flash Attention
HF Kernels Flash Attention 3 HuggingFace kernels Flash Attention 3
Memory Efficient Attention Memory efficient attention implementation
Sage Attention Sage attention implementation
xFormers xFormers attention implementation

Causal Conv1D

Causal Conv1D Latency
Implementation Description
HF Kernels Causal Conv1D HuggingFace kernels implementation
PyTorch Causal Conv1D PyTorch native implementation

Activation

Activation Latency
Implementation Description
HF Kernels SwiGLU HuggingFace kernels SwiGLU implementation
PyTorch SwiGLU PyTorch native SwiGLU implementation

ReLU

ReLU Latency
Implementation Description
HF Kernels ReLU HuggingFace kernels ReLU implementation
PyTorch ReLU PyTorch native ReLU implementation