Dataset Optimization, Speech-to-Text

Finding Mislabels in Hebrew STT: SASPEECH (RoboShaul)

By
Tomer Raviv
October 2, 2024
TL;DR
Hirundo analyzed the SASPEECH (also known as RoboShaul) dataset, a landmark Hebrew Speech-to-Text corpus covering 40 hours of audio recordings and the corresponding transcript by Israeli broadcaster Shaul Amsterdamski.
Utilizing our patent-pending Data Influence Engine, we uncovered errors (mislabels, or mis-transcriptions) in nearly 10% of the automatically-transcribed subset of the dataset, and 1.5% errors in the manually-transcribed subset. Our findings are available for public use to the Israeli data science community.

Introduction: SASPEECH (RoboShaul)

The SASPEECH dataset (commonly known as RoboShaul) features speech recordings paired with corresponding sentences, all voiced by Israeli broadcaster Shaul Amsterdamski. Widely used by the machine learning community in Israel, it serves as a foundation for developing both text-to-speech (TTS) and speech-to-text (STT) systems. The dataset consists of both manually-tagged and automatically-tagged segments. However, like many datasets, it is not free from labeling errors (mis transcriptions), which can occur during both manual and automatic annotation processes.

In this post, we leverage Hirundo's patent-pending Data Influence Engine to identify and correct some of these mislabeled entries.

  • Manually-transcribed segments: In this portion of the dataset, we identified errors in 1.5% of the audio segments, and resolved them. This corrected subset is now available as a resource for the community at the end of this post.
  • Automatically-transcribed segments: In this subset, we identified errors in nearly 10% of the segments. Due to the scope of this portion and our findings, we have not provided an automated correction, but these segment IDs are provided for further investigation at the end of this post.

The Dataset

The dataset is divided into two subsets: a gold-standard set of about 4 hours with manually-tagged transcriptions and an automatic subset with machine-generated transcriptions (that undergone basic alignment) of 24 hours. Both subsets include pairs of short audio clips (under 30 seconds) and their corresponding text transcriptions.

For instance, the first few examples in the tagged data include:

With the transcription: שָׁלוֹם, צְלִיל אַבְרָהָם.

With the transcription: לְגַמְרֵי, מַדְהִים, לֹא

With the transcription: וְדַוְוקָא בִּגְלַל שֶׁכּוּלָּנוּ הָיִינוּ עֲסוּקִים בַּמִּלְחָמָה, הַפּוֹדְקָאסְט הַזֶּה שֶׁלָּנוּ הוּא הִזְדַּמְּנוּת לְהַשְׁלִים פְּעָרִים שֶׁל מָה שֶׁקָּרָה בַּמִּשְׁפָּט בַּשְּׁבוּעַיִים הָאַחֲרוֹנִים.

Finding & Fixing Mislabels in SASPEECH with Hirundo’s Engine

This dataset is one of the first large-scale open-source Hebrew datasets available for training TTS and STT models, aimed at advancing speech technology for the low-resource Hebrew language. It's important to note that training on mislabeled data can negatively impact the performance of the models, making accurate labeling crucial for high-quality results.

At Hirundo, a significant portion of our efforts is focused on enhancing AI datasets. Our platform leverages a proprietary data influence engine that tracks the impact of individual data samples throughout the training process of machine learning models. This cutting-edge technique allows us to detect inaccuracies with unparalleled precision, outperforming traditional methods such as statistical analysis, consensus labeling, and the use of external pre-trained models to improve datasets.

Using our approach, we identified a notable number of mislabeled entries in the RoboShaul dataset. While the majority of errors were found in the automatically-transcribed part, we also discovered mislabeled entries in the manual dataset, which contains almost 3,000 audio-sentence pairs that amount to 4 hours of speech, with errors occurring in ~1.5% of the manually-tagged sentences. In the following, we show a few mislabeled examples.

Original transcript: לָמָּה בְּסּוֹפֵר תָּמִיד יֵשׁ פִּירָמִידוֹת יָפוֹת מְסוּדָּרוֹת… אָהַמ…

But what is the error? There are two words switched between the text and the audio!

Fixed transcript: לָמָּה בְּסּוֹפֵר תָּמִיד יֵשׁ פִּירָמִידוֹת מְסוּדָּרוֹת יָפוֹת… אָהַמ…

Original transcript: "אִם לְמַדְתֶּם בְּמַעֲרֶכֶת הַחִינּוּךְ הַיִּשְׂרְאֵלִית בִּשְׁנוֹת הַ-… אָהַמ…”

But what is the error? The word הַיִּשְׂרְאֵלִית is not said in the audio, but appears in the text!

Fixed transcript: "אִם לְמַדְתֶּם בְּמַעֲרֶכֶת הַחִינּוּךְ בִּשְׁנוֹת הַ-… אָהַמ…"

Original transcript: זֶה חִלְחֵל גַּם לִכְלָל, זֶה חִלְחֵל גַּם לַחֲבָרוֹת הַגְּדוֹלוֹת יוֹתֵר, כְּלָל בִּיטּוּחַ, וּמִיגְדֵּל, וְאַיְּילוֹן.

This one is easy, there are a few extra words in the text.

Fixed transcript: זֶה חִלְחֵל גַּם לַחֲבָרוֹת הַגְּדוֹלוֹת יוֹתֵר, כְּלָל בִּיטּוּחַ, וּמִיגְדֵּל, וְאַיְּילוֹן.

We manually reviewed and corrected all the flagged examples from the manually-tagged subset, providing the corrected sentences as a contribution to the community. For clarity, we annotated the types of errors—substitution, deletion, or insertion—on both the original and corrected sentences.

In contrast, the automatically-tagged subset exhibited a much higher rate of errors. Our analysis flagged nearly 10% of the speech-sentence pairs as containing at least one error, such as a substitution, deletion, or insertion. We have gathered the file IDs of these suspected mismatches into a CSV file for further examination. Both the corrected manual subset and the flagged automatic subset are available for download below and on our Hugging Face page.

Manual Subset: Errors and Corrections(XLSX)

Automated Subset: Errors (CSV)

Conclusions

Cleaning task-specific datasets is especially crucial for low-resource languages like Hebrew, as it directly impacts the development of effective text-to-speech and speech-to-text engines.

In this post, we explored how Hirundo’s data influence engine was used to identify mislabeled entries in the SASPEECH dataset. We corrected the errors in the manually tagged data and flagged potential mislabeled entries in the automatically generated transcriptions. These refined resources are now available to the community to further advance Hebrew TTS and STT technologies.

Tomer Raviv
Senior Deep Learning Researcher, Hirundo

Ready to forget?

Start removing unwanted data with a few clicks