Skip to content

Models

Jabberjay bundles seven model families. Each is downloaded from Hugging Face Hub on first use and cached locally.


Choosing a model

Model Type Datasets Requires
VIT Vision Transformer ASVspoof2019, ASVspoof5, VoxCelebSpoof dataset, visualisation
AST Audio Spectrogram Transformer ASVspoof2019, ASVspoof5, VoxCelebSpoof dataset
Wav2Vec2 Self-supervised transformer ASVspoof2019
HuBERT Self-supervised transformer In-The-Wild
WavLM Self-supervised transformer Mixed deepfake
RawNet2 End-to-end CNN ASVspoof 2021
Classical KNN classifier ASVspoof2019

Simple rule of thumb:

  • For a quick, general-purpose result — use WavLM or HuBERT
  • For the lowest error rate on In-The-Wild audio — use HuBERT (EER 1.43%)
  • For a lightweight baseline with no deep learning — use Classical
  • To sweep all models and compare — use jj.load() once, then call jj.detect() for each

VIT

Vision Transformer classifiers that convert the audio to a 2D image (spectrogram) and classify it visually.

Nine variants, covering three visualisation types × three training datasets:

Visualisation ASVspoof2019 ASVspoof5 VoxCelebSpoof
ConstantQ
MelSpectrogram
MFCC
jj.detect("audio.wav", model="VIT", dataset="VoxCelebSpoof", visualisation="ConstantQ")
jj.detect("audio.wav", model="VIT", dataset="ASVspoof2019", visualisation="MFCC")
jj.detect("audio.wav", model="VIT", dataset="ASVspoof5", visualisation="MelSpectrogram")

Visualisations:

  • ConstantQ — Constant-Q transform; good frequency resolution across the full spectrum
  • MelSpectrogram — Mel-scaled spectrogram; perceptually motivated, widely used in speech
  • MFCC — Mel-frequency cepstral coefficients; compact and speech-focused

AST

Audio Spectrogram Transformer. Applies a transformer directly to a patch-based spectrogram without a CNN backbone.

Available for ASVspoof2019, ASVspoof5, and VoxCelebSpoof datasets.

jj.detect("audio.wav", model="AST")  # defaults to VoxCelebSpoof
jj.detect("audio.wav", model="AST", dataset="VoxCelebSpoof")
jj.detect("audio.wav", model="AST", dataset="ASVspoof2019")

Note

dataset is optional for AST and defaults to VoxCelebSpoof if not specified.


Wav2Vec2

Gustking/wav2vec2-large-xlsr-deepfake-audio-classification

Wav2Vec2-XLSR-300M fine-tuned on ASVspoof2019. EER 4.01%.

jj.detect("audio.wav", model="Wav2Vec2")

HuBERT

abhishtagatya/hubert-base-960h-itw-deepfake

HuBERT-base fine-tuned on the In-The-Wild dataset. EER 1.43% — the lowest of all bundled models on real-world audio.

jj.detect("audio.wav", model="HuBERT")

WavLM

DavidCombei/wavLM-base-Deepfake_V2

WavLM-base fine-tuned on a mixed deepfake dataset.

jj.detect("audio.wav", model="WavLM")

RawNet2

End-to-end anti-spoofing network via rawnet2-antispoofing, with weights from ASVspoof 2021. Operates directly on raw waveforms — no feature extraction step.

jj.detect("audio.wav", model="RawNet2")

Note

scores is None for RawNet2 — only label, is_bonafide, and confidence are populated.


Classical

Feature-based KNN classifier trained on ASVspoof2019. Extracts hand-crafted audio features (MFCCs, spectral features) and classifies with k-nearest neighbours. Fast and dependency-light compared to the transformer models.

jj.detect("audio.wav", model="Classical")

Note

scores is None for Classical — only label, is_bonafide, and confidence are populated.