Jam-ALT is an automatic lyrics transcription (ALT) benchmark, based on the JamendoLyrics dataset.

The lyrics have been revised according to a newly compiled annotation guide, which unifies the music industry’s lyrics transcription and formatting guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds.

This page visualizes the differences between the original JamendoLyrics dataset and our revision.

While the dataset does contain line-level timings, it is not time-aligned at the word level. To evaluate automatic lyrics alignment (ALA), please use JamendoLyrics, which is the standard benchmark for that task.

Apart from the classical word error rate, the benchmark includes metrics that take into account letter case, punctuation and line/section breaks.

The benchmark is described in our ISMIR 2024 paper Lyrics Transcription for Humans: A Readability-Aware Benchmark. Line-level timings were added in the ICME 2025 workshop paper Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper.

Running the benchmark

The dataset can be loaded easily using Hugging Face datasets and the evaluation is implemented in our alt-eval package:

from datasets import load_dataset
from alt_eval import compute_metrics

dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test")
# transcriptions: list[str]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])

By default, the dataset includes the audio, allowing you to run transcription directly. For example, the following code can be used to evaluate Whisper:

dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test")
dataset = dataset.cast_column("audio", datasets.Audio(decode=False))  # Get the raw audio file, let Whisper decode it

model = whisper.load_model("tiny")
transcriptions = [
  "\n".join(s["text"].strip() for s in model.transcribe(a["path"])["segments"])
  for a in dataset["audio"]
]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])

Alternatively, if you already have transcriptions, you might prefer to skip loading the audio:

dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test").remove_columns("audio")

Line-level dataset

We also provide a line-level version of the dataset, named Jam-ALT Lines. It can be loaded as follows:

load_dataset("jamendolyrics/jam-alt-lines", split="test")

See the dataset page for more information.

Citation

When using the benchmark, please cite our ISMIR 2024 paper as well as the original JamendoLyrics paper. For the line-level timings, please cite the ICME workshop paper.

@misc{cifka-2024-jam-alt,
  author       = {Ond\v{r}ej C\'ifka and
                  Hendrik Schreiber and
                  Luke Miner and
                  Fabian-Robert St\"oter},
  title        = {Lyrics Transcription for Humans: A Readability-Aware Benchmark},
  booktitle    = {Proceedings of the 25th International Society for 
                  Music Information Retrieval Conference},
  year         = 2024,
  publisher    = {ISMIR},
  note         = {to appear; preprint arXiv:2408.06370}
}
@inproceedings{durand-2023-contrastive,
  author={Durand, Simon and Stoller, Daniel and Ewert, Sebastian},
  booktitle={2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages}, 
  year={2023},
  pages={1-5},
  address={Rhodes Island, Greece},
  doi={10.1109/ICASSP49357.2023.10096725}
}
@inproceedings{syed-2025-mss-alt,
  author       = {Jaza Syed and
                  Ivan Meresman-Higgs and
                  Ond{\v{r}}ej C{\'{\i}}fka and
                  Mark Sandler},
  title        = {Exploiting Music Source Separation for Automatic Lyrics Transcription with {Whisper}},
  booktitle    = {2025 {IEEE} International Conference on Multimedia and Expo Workshops (ICMEW)},
  publisher    = {IEEE},
  year         = {2025}
}