Jam-ALT is an automatic lyrics transcription (ALT) benchmark, based on the JamendoLyrics dataset.
The lyrics have been revised according to a newly compiled annotation guide, which unifies the music industry’s lyrics transcription and formatting guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds.
This page visualizes the differences between the original JamendoLyrics dataset and our revision.
Please note that the dataset is not time-aligned as it does not easily map to the timestamps from JamendoLyrics. To evaluate automatic lyrics alignment (ALA), please use JamendoLyrics directly.
Apart from the classical word error rate, the benchmark includes metrics that take into account letter case, punctuation and line/section breaks.
The benchmark is described in our forthcoming ISMIR 2024 paper Lyrics Transcription for Humans: A Readability-Aware Benchmark (and an earlier version in the ISMIR 2023 late-breaking demo Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark).
Running the benchmark
The dataset can be loaded easily using Hugging Face datasets
and the evaluation is implemented in our alt-eval
package:
from datasets import load_dataset
from alt_eval import compute_metrics
dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0")["test"]
# transcriptions: list[str]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
By default, the dataset includes the audio, allowing you to run transcription directly. For example, the following code can be used to evaluate Whisper:
dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0")["test"]
dataset = dataset.cast_column("audio", datasets.Audio(decode=False)) # Get the raw audio file, let Whisper decode it
model = whisper.load_model("tiny")
transcriptions = [
"\n".join(s["text"].strip() for s in model.transcribe(a["path"])["segments"])
for a in dataset["audio"]
]
compute_metrics(dataset["text"], transcriptions, languages=dataset["language"])
Alternatively, if you already have transcriptions, you might prefer to skip loading the audio:
dataset = load_dataset("audioshake/jam-alt", trust_remote_code=True, revision="v1.0.0", with_audio=False)["test"]
Citation
When using the benchmark, please cite our ISMIR 2024 paper as well as the original JamendoLyrics paper:
@misc{cifka-2024-jam-alt,
author = {Ond\v{r}ej C\'ifka and
Hendrik Schreiber and
Luke Miner and
Fabian-Robert St\"oter},
title = {Lyrics Transcription for Humans: A Readability-Aware Benchmark},
booktitle = {Proceedings of the 25th International Society for
Music Information Retrieval Conference},
year = 2024,
publisher = {ISMIR},
note = {to appear; preprint arXiv:2408.06370}
}
@inproceedings{durand-2023-contrastive,
author={Durand, Simon and Stoller, Daniel and Ewert, Sebastian},
booktitle={2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages},
year={2023},
pages={1-5},
address={Rhodes Island, Greece},
doi={10.1109/ICASSP49357.2023.10096725}
}