Jam-ALT is an automatic lyrics transcription (ALT) benchmark, based on the JamendoLyrics dataset.
The lyrics have been revised according to a newly compiled annotation guide, which unifies the music industry’s lyrics transcription and formatting guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds.
This page visualizes the differences between the original JamendoLyrics dataset and our revision.
While the dataset does contain line-level timings, it is not time-aligned at the word level. To evaluate automatic lyrics alignment (ALA), please use JamendoLyrics, which is the standard benchmark for that task.
Apart from the classical word error rate, the benchmark includes metrics that take into account letter case, punctuation and line/section breaks.
The benchmark is described in our ISMIR 2024 paper Lyrics Transcription for Humans: A Readability-Aware Benchmark. Line-level timings were added in the ICME 2025 workshop paper Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper.
Running the benchmark
The dataset can be loaded easily using Hugging Face datasets
and the evaluation is implemented in our alt-eval
package:
from datasets import load_dataset
from alt_eval import compute_metrics
dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test")
# transcriptions: list[str]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
By default, the dataset includes the audio, allowing you to run transcription directly. For example, the following code can be used to evaluate Whisper:
dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test")
dataset = dataset.cast_column("audio", datasets.Audio(decode=False)) # Get the raw audio file, let Whisper decode it
model = whisper.load_model("tiny")
transcriptions = [
"\n".join(s["text"].strip() for s in model.transcribe(a["path"])["segments"])
for a in dataset["audio"]
]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
Alternatively, if you already have transcriptions, you might prefer to skip loading the audio:
dataset = load_dataset("jamendolyrics/jam-alt", revision="v1.4.0", split="test").remove_columns("audio")
Line-level dataset
We also provide a line-level version of the dataset, named Jam-ALT Lines. It can be loaded as follows:
load_dataset("jamendolyrics/jam-alt-lines", split="test")
See the dataset page for more information.
Citation
When using the benchmark, please cite our ISMIR 2024 paper as well as the original JamendoLyrics paper. For the line-level timings, please cite the ICME workshop paper.
@misc{cifka-2024-jam-alt,
author = {Ond\v{r}ej C\'ifka and
Hendrik Schreiber and
Luke Miner and
Fabian-Robert St\"oter},
title = {Lyrics Transcription for Humans: A Readability-Aware Benchmark},
booktitle = {Proceedings of the 25th International Society for
Music Information Retrieval Conference},
year = 2024,
publisher = {ISMIR},
note = {to appear; preprint arXiv:2408.06370}
}
@inproceedings{durand-2023-contrastive,
author={Durand, Simon and Stoller, Daniel and Ewert, Sebastian},
booktitle={2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages},
year={2023},
pages={1-5},
address={Rhodes Island, Greece},
doi={10.1109/ICASSP49357.2023.10096725}
}
@inproceedings{syed-2025-mss-alt,
author = {Jaza Syed and
Ivan Meresman-Higgs and
Ond{\v{r}}ej C{\'{\i}}fka and
Mark Sandler},
title = {Exploiting Music Source Separation for Automatic Lyrics Transcription with {Whisper}},
booktitle = {2025 {IEEE} International Conference on Multimedia and Expo Workshops (ICMEW)},
publisher = {IEEE},
year = {2025}
}