Jam-ALT is an automatic lyrics transcription (ALT) benchmark, based on the JamendoLyrics dataset.

The lyrics have been revised according to a newly compiled annotation guide, which unifies the music industry’s lyrics transcription and formatting guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds.

This page visualizes the differences between the original JamendoLyrics dataset and our revision.

Please note that the dataset is not time-aligned as it does not easily map to the timestamps from JamendoLyrics. To evaluate automatic lyrics alignment (ALA), please use JamendoLyrics directly.

Apart from the classical word error rate, the benchmark includes metrics that take into account letter case, punctuation and line/section breaks.

The benchmark is described in our forthcoming ISMIR 2024 paper Lyrics Transcription for Humans: A Readability-Aware Benchmark (and an earlier version in the ISMIR 2023 late-breaking demo Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark).

Running the benchmark

The dataset can be loaded easily using Hugging Face datasets and the evaluation is implemented in our alt-eval package:

from datasets import load_dataset
from alt_eval import compute_metrics

dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0")["test"]
# transcriptions: list[str]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])

By default, the dataset includes the audio, allowing you to run transcription directly. For example, the following code can be used to evaluate Whisper:

dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0")["test"]
dataset = dataset.cast_column("audio", datasets.Audio(decode=False))  # Get the raw audio file, let Whisper decode it

model = whisper.load_model("tiny")
transcriptions = [
  "\n".join(s["text"].strip() for s in model.transcribe(a["path"])["segments"])
  for a in dataset["audio"]
]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])

Alternatively, if you already have transcriptions, you might prefer to skip loading the audio:

dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0", with_audio=False)["test"]

Citation

When using the benchmark, please cite our ISMIR 2024 paper as well as the original JamendoLyrics paper:

@misc{cifka-2024-jam-alt,
  author       = {Ond\v{r}ej C\'ifka and
                  Hendrik Schreiber and
                  Luke Miner and
                  Fabian-Robert St\"oter},
  title        = {Lyrics Transcription for Humans: A Readability-Aware Benchmark},
  booktitle    = {Proceedings of the 25th International Society for 
                  Music Information Retrieval Conference},
  year         = 2024,
  publisher    = {ISMIR},
  note         = {to appear; preprint arXiv:2408.06370}
}
@inproceedings{durand-2023-contrastive,
  author={Durand, Simon and Stoller, Daniel and Ewert, Sebastian},
  booktitle={2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages}, 
  year={2023},
  pages={1-5},
  address={Rhodes Island, Greece},
  doi={10.1109/ICASSP49357.2023.10096725}
}