1 Introduction

Speech recognition has become a critical component in numerous applications, ranging from virtual assistants and transcription services to voice-controlled devices and accessibility tools. The increasing reliance on speech recognition machine learning models necessitates robust and comprehensive evaluation methodologies to ensure their performance, reliability, and adaptability across diverse scenarios.

Existing speech recognition models evaluations often rely on curated datasets, such as LibriSpeech [25], CommonVoice [4], and TIMIT [32]. While these datasets provide a controlled environment for evaluation, they may not capture the full spectrum of real-world scenarios, potentially limiting the model’s generalizability. Additionally, these datasets may not be updated frequently, resulting in potential stagnation in performance evaluation.

In this article, we introduce Mi-Go (the name will be explained further), a tool designed to evaluate the prediction performance of general-purpose speech recognition machine learning models. Mi-Go harnesses the power of YouTube as a data source, providing access to a virtually unlimited repository of diverse audio-visual content. YouTube offers a rich and continuously updated collection of spoken language data, encompassing various languages, accents, dialects, speaking styles, and audio quality levels. This makes it an ideal source of data which can be used to evaluate the adaptability and performance of speech recognition models in real-world situations.

In recent years, there has been a growing interest in harnessing the vast amount of data available on platforms such as YouTube for machine learning tasks. Various approaches have been proposed to collect and process data from YouTube, including YouTube-8M [1], AudioSet [11], and GigaSpeech [6]. However, these methods primarily focus on video and audio classification tasks rather than the evaluation of speech recognition models.

The landscape of speech recognition technology has witnessed a paradigm shift, driven by rapid advancements in deep learning and artificial intelligence. Groundbreaking architectures, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and, more recently, transformer-based models, have revolutionized this domain, offering unprecedented accuracy in transcribing human speech. These models, trained on vast datasets, have demonstrated remarkable proficiency in navigating the complexities of language, including accents, dialects, and noise interference. The emergence of these models not only underscores the accelerated pace of development in this field but also leads one to believe that in the near future seamless human-computer interaction will become the norm. It should be noted that while these advancements present exciting prospects, they also raise compelling questions concerning data privacy, algorithmic bias, and the digital divide.

In our study, we address this need by proposing―and then empirically investigating the prediction performance of speech recognition model―evaluation tool which utilizes YouTube as a data source, providing access to an extensive and diverse collection of audio samples for evaluation purposes. This approach ensures that the performance assessment remains up-to-date and relevant, capturing the nuances of real-world speech more accurately than curated datasets. To the best of our knowledge, there is little or even no research on using YouTube and video subtitles provided by the YouTube users for speech recognition evaluation. Considering all the above, our goal is to answer the following research question:

  1. (RQ)

    Will evaluation of the selected speech recognition machine learning model using YouTube as a data source, as made possible by Mi-Go, produce similar results (measured using the same metric) as the evaluation conducted by the model creators?

Mi-Go automates the process of data extraction, annotation, and evaluation from YouTube, ensuring an up-to-date and representative sample for evaluation purposes. By leveraging algorithms for data filtering and annotation, Mi-Go facilitates a thorough and unbiased evaluation of the speech recognition models. Moreover, Mi-Go is designed to be easily adaptable, allowing for seamless integration with variety of different speech recognition solutions, making it a versatile and valuable tool in the speech recognition research community.

The primary motivation behind the development of the Mi-Go tool stems from the recognition of several limitations in existing approaches to evaluate speech recognition models. As speech recognition technology continues to play a critical role in various applications, including voice assistants, transcription services, and accessibility tools, ensuring the robustness and accuracy of these models is crucial.

Other speech recognition model evaluation methods often rely on static, curated datasets which, while useful for establishing a controlled environment, may not fully represent the diversity and complexity of real-world speech scenarios. This can lead to overfitting and limit the model’s generalizability, ultimately affecting its performance in real-world applications.

Additionally, as the field of speech recognition rapidly advances, existing evaluation methods may struggle to keep pace with new developments and challenges, potentially hindering the progress of these models. By utilizing YouTube as a data source, Mi-Go aims to overcome these limitations and offers a more comprehensive and dynamic evaluation environment.

Another motivation for the development of Mi-Go is the need for a flexible and adaptable tool capable of accommodating variety of speech recognition models. This adaptability allows researchers and developers to compare and contrast the performance of various models, facilitating the continuous improvement and refinement of speech recognition systems.

By addressing these limitations and providing a dynamic, diverse, and adaptable evaluation tool, Mi-Go aspires to contribute significantly to the field of speech recognition research, driving innovation and fostering the development of highly accurate and robust models for various applications.

In a summary, the Mi-Go tool is a contribution to the scientific and speech recognition community for the following reasons:

  • Rich and diverse test data source. Mi-Go leverages YouTube, a platform with vast and continuously updated content, to provide a rich source of diverse audio-visual content. This includes various languages, accents, dialects, speaking styles, and audio quality levels. Such diversity is ideal for evaluating the adaptability and performance of speech recognition models in real-world situations, ensuring robustness, accuracy, and adaptability to diverse languages and acoustic conditions.

  • Dynamic evaluation environment. By using YouTube as a data source, Mi-Go addresses limitations of previous approaches that often relied on static and potentially outdated datasets. It offers a more comprehensive and dynamic evaluation environment that reflects current real-world scenarios. This adaptability allows for the comparison of various models and facilitates the continuous improvement and refinement of speech recognition systems.

  • Practical and theoretical contributions. The experimental results obtained through Mi-Go highlight the utility of YouTube as a valuable data source for the evaluation of speech recognition models. This not only underscores the platform’s potential in enhancing model robustness and adaptability but also contributes to the academic discourse by providing a novel methodology for speech recognition research. Additionally, Mi-Go’s approach to contrasting machine-generated transcriptions against human-made subtitles offers insights into potential misuse of subtitles, such as for search engine optimization purposes, thereby adding a layer of practical utility in detecting transcription anomalies.

2 YouTube as a data source for speech recognition model evaluation

With over 2 billion monthly active users and a diverse array of content uploaded every day, YouTube offers a rich resource for researchers and developers working on speech recognition technology. By tapping into this wealth of multilingual and multi-genre content, it is possible to evaluate and refine speech recognition models across various languages, dialects, and acoustic environments.

A vast digital archive. YouTube stands as a colossal repository of digital content, presenting an unparalleled resource for research across various disciplines. As the world’s largest video sharing platform, it hosts an estimated billions of videos, a number that continues to grow exponentially with about 500 hours of new content uploaded every minute. Exact number of hosted videos is not known, but is estimated for not less than 2.5 billion of videos [3]. The number of YouTube “Shorts” videos only, identified through the usage of the hashtag #shorts, reaches approximately 828 million in February 2024Footnote 1.

Diversity of content. YouTube’s vast library of user-generated content covers an extensive range of topics, languages, and styles. This diversity enables the evaluation of speech recognition models in real-world scenarios, such as noisy environments, various accents, and even low-quality audio recordings. By evaluating models on such a diverse dataset, researchers can identify potential weaknesses and areas for improvement, ultimately resulting in more robust and accurate speech recognition systems.

Multilingual corpus. One of the key advantages of using YouTube for speech recognition model evaluation is the platform’s multilingual nature. Videos on the site are available in numerous languages, allowing for the assessment of models’ performance across different linguistic settings. This multilingual corpus is invaluable for developing models that can handle a variety of languages, accents, and dialects, thereby expanding their utility and applicability.

Availability of human-generated transcripts. Many YouTube videos come with human-generated subtitles, either provided by content creators or contributed by users through the platform’s community contributions feature. These transcripts serve as valuable ground-truth data for evaluating speech recognition models, as they offer a reliable source of comparison for the models’ output. By comparing model-generated transcriptions with human-generated ones, researchers can assess the accuracy and performance of their models, identifying areas where improvements are needed.

Potential for continuous model improvement. The ever-growing volume of content on YouTube presents an opportunity for continuous improvement and adaptation of speech recognition models. As new videos are uploaded, models can be re-evaluated and fine-tuned to ensure they remain up-to-date and effective in an ever-changing linguistic landscape. This continuous feedback loop helps researchers identify trends, challenges, and emerging language patterns, which can be incorporated into model updates.

YouTube is an invaluable platform for speech recognition model evaluation due to its diverse, multilingual content and the availability of human-generated transcripts. By leveraging this vast resource, researchers and developers can evaluate and refine their models, ensuring they are robust, accurate, and adaptable to a variety of languages and acoustic conditions.

3 Related work

Studies leveraging YouTube in the area of automatic speech recognition have made significant strides across various facets of the field. These investigations utilize YouTube’s extensive library of videos to create datasets, improve speech recognition systems, and explore new approaches to automatic speech recognition, showcasing the platform’s value in advancing speech recognition technology research. Key insights from these works include:

  • Datasets for automatic speech recognition models creation. Researchers have developed methodologies for creating databases for audio/visual speech recognition using YouTube videos, such as the comprehensive Spanish dataset by Córdova Esparza et al [7]. In their work, researchers presented a novel approach for creating an audio/visual speech recognition database, particularly addressing the scarcity of datasets in languages other than English, with a focus on Spanish. By selecting hundreds of YouTube videos, the researchers were able to extract facial features and align voice with text with millisecond accuracy, creating a dataset of over 100,000 samples. That methodology not only facilitated the development of automatic speech recognition systems in underrepresented languages but also provided a blueprint for creating datasets in any language by selecting appropriate YouTube content. Takamichi et al. [29] contributed to the diversification of automatic speech recognition research resources through the JTubeSpeech corpus, which consists of Japanese speech collected from YouTube. This corpus was designed for both speech recognition and speaker verification tasks, addressing the need for comprehensive datasets in Japanese for training and evaluating automatic speech recognition systems. The corpus’s creation from YouTube videos ensured a variety of speech contexts and speaker demographics, enhancing the robustness of automatic speech recognition models trained on it. Lakomkin et al. [20] developed the KT-speech-crawler, an automated tool for constructing speech recognition datasets from YouTube videos. This tool leveraged automatic captioning provided by YouTube to generate datasets, significantly reducing the manual effort required in dataset creation and enabling researchers to easily compile large-scale datasets tailored to specific speech recognition research needs. Latest work in the field―creation of Yodas, a YouTube-derrived Dataset, by Li et al. [22], showcases the ongoing efforts to harness YouTube content as diverse and comprehensive training data resource for developing new, robust speech recognition models. By compiling a diverse set of audio and speech samples from YouTube, Yodas aims to provide a versatile dataset that supports a wide range of automatic speech recognition tasks, including dialect and accent recognition, speech-to-text conversion, and speaker verification.

  • Improvement of automatic speech recognition systems. Liao et al. [23], from Google, explored usage of new large scale deep neural network acoustic modeling for using in YouTube video transcription. By leveraging the massive amount of unlabeled audiovisual content on YouTube, the researchers were able to enhance the modeling process, by using video transcripts uploaded by YouTube users and thus demonstrating the potential of semi-supervised learning approaches in improving automatic speech recognition systems’ performance, especially in noisy and challenging acoustic environments. Their findings then were used in actual YouTube automatic speech transcription improvements.

  • Audio-visual speech recognition. In their work, Serdyuk et al. [28] delved into the enhancement of automatic speech recognition by incorporating video content from YouTube, a novel approach that significantly improved speech recognition accuracy. That study leveraged a large corpus of YouTube videos to train models, focusing on how the visual modality, particularly the movement of the speaker’s mouth, could augment audio features for speech recognition tasks. By replacing traditional 3D convolutional neural networks with a video transformer to extract visual features, Serdyuk and his team demonstrated a substantial improvement in word error rates on both a labeled subset of YouTube videos and the LRS3-TED public corpus (described in [2]). Their methodology highlighted the potential of utilizing video content alongside audio data to advance the capabilities of automatic speech recognition systems. This research not only showcased the importance of YouTube as a rich data source for speech recognition technologies but also opened new pathways for enhancing speech recognition accuracy by integrating audio-visual data, paving the way for more sophisticated and efficient automatic speech recognition systems.

  • Bias and inclusivity in automatic speech recognition. Koenecke et al. [18] uncovered significant racial disparities in the performance of commercial automatic speech recognition systems, including those developed by major tech companies. By analyzing speech from white and African American speakers, the study revealed a higher word error rate for African American speakers, highlighting a critical area for improvement in making automatic speech recognition technologies more inclusive and equitable. Tatman and Kasten [30] investigated the effects of talker dialect, gender, and race on the accuracy of Bing Speech and YouTube automatic captions. Their findings emphasized the impact of sociolinguistic factors on automatic speech recognition accuracy, urging the development of more sophisticated models that could better accommodate the diversity of human speech.

  • Utilizing YouTube as automatic speech recognition tool. Kim et al. [17] embarked on an insightful exploration into the capabilities of automatic speech recognition tools by utilizing YouTube’s automatic transcription service as a benchmark for automatic speech recognition accuracy. In their study, they meticulously compared manual transcriptions with those generated automatically by YouTube, alongside other leading speech recognition platforms such as Google Cloud, IBM Watson, Microsoft Azure, and Trint. Their analysis provided a comprehensive evaluation of the relative performance of these services, with a particular focus on YouTube’s efficacy in providing accurate transcriptions. This approach not only highlighted YouTube’s potential as an accessible and effective tool for automatic speech recognition but also contributed to the broader discourse on the reliability and accuracy of free, platform-based speech recognition services. Through their comparative study, Kim et al. shed light on the strengths and limitations of YouTube’s transcription capabilities, offering valuable insights for researchers, developers, and users seeking to leverage automatic speech recognition technology in various contexts.

These studies illustrate the extensive use of YouTube as a rich data source for automatic speech recognition research, ranging from training dataset creation to addressing biases and inclusivity in speech technologies. However, to the best of our knowledge, there is no work describing the direct use of YouTube to evaluate the functional performance of the existing machine learning models used for automatic speech recognition.

4 Mi-Go Tool

Mi-Go was written in Python programming language. Its source code is available for download under Apache-2.0 license at the following address: https://github.com/Kowalski1024/Mi-Go

In the following, we will describe the tool by focusing on the subsequent operations of the tool – from launching it to saving the evaluation results of the selected speech recognition model.

4.1 Test Plan preparation

To start working with the tool, we need a file in JSON format, called a Test Plan. This is illustrated as number 1 in Fig. 1. In a special circumstances, Test Plan file can be manually written, but it is more efficient to generate it, using an additional script named the Test Plan Generator. This script queries YouTube’s API to compile a random list of videos, basing on the command line parameters specifying the category of the videos, language, duration, and desired quantity of list items (details can be found in Appendix 1). It is essential that the YouTube clearly indicates, that video has human-made subtitles, and only such videos are considered. To query the API, Test Plan Generator uses external Python library called youtube-transcript-apiFootnote 2. After querying the API, the Test Plan file contains all the necessary metadata about the videos being used in further evaluation and it also stores information about the selected parameters and token for YouTube Data API, which can be used in next test iterations, if needed.

Fig. 1
Fig. 1
Full size image

Mi-Go and speech recognition model evaluation phases (described in text)

4.2 Data extraction and transcription

In the next step, marked with number 2 in Fig. 1, Mi-Go reads the Test Plan and, basing on that plan, downloads from YouTube the audio track of each video from the plan and the subtitles for that video. Thus, for each video, we have a pair consisting of an audio file (2a) in and human-generated subtitles (marked as 2b).

In the next step (number 3 in Fig. 1), a speech recognition model is employed to convert the downloaded audio into a textual transcript. It is done by the TranscriptTest component that executes the speech recognition machine learning model against audio data collected from YouTube. That component can be adjusted for specified speech recognition model by extending that component with model-specific code. It allows to use different models from popular “Hugging Face” machine learning models repositoryFootnote 3 as well as models dedicated for such toolkits like ESPnet or NeMo.

To eliminate inconsequential textual differences, both the subtitles downloaded from YouTube (number 2b in Fig. 1) and those generated by the speech recognition model (4) undergo a normalization process (5a and 5b) using an OpenAI’s normalization functionFootnote 4.

4.3 Evaluation and metrics

Speech recognition model evaluation involves comparing the human-made subtitles downloaded from YouTube and those generated by the model (number 6 in Fig. 1). For that evaluation, Mi-Go tool uses a open-source JiWER libraryFootnote 5 to calculate Word Error Rate (WER) measure [27]. WER is a common metric used to assess the performance of speech recognition systems, automatic translation systems, and other tasks involving transcription or translation. It is calculated by determining the minimum number of operations needed to transform the system output into the correct output. These operations include (see Eq. 1): word insertions I, word deletions D, and word substitutions S. To compute the WER, the total number of these operations is divided by the total number of words in the correct output N (in our case: total number of words in subtitles attached to a particular YouTube video), yielding a ratio that represents the rate of errors per word. The lower the WER, the better the performance of the system, as it means fewer errors were made.

$$\begin{aligned} \text{ WER } = \frac{S + D + I}{N} \cdot 100\% \end{aligned}$$
(1)

The concept of WER has been part of the field of automatic speech recognition and computational linguistics for many years. It is based on the Levenshtein distance or edit distance, a string metric for measuring the difference between two sequences, introduced by Vladimir Levenshtein in 1965 [21]. The exact individual or group that first applied this concept specifically as Word Error Rate in speech recognition or translation systems is not clearly documented. It likely emerged from the academic and industry communities working on speech and language processing technologies. WER has since become a standard measure in these fields. In some cases, WER is expressed as a percentage (by multiplying the original formula by 100%), especially when easy understanding of the measure is a main concern.

The comparison results are stored both in the SQLite database (7b in Fig. 1) and directly in the previously used Test Plan file (7a). Such a Test Plan file, with its evaluation results recorded, can be reused for subsequent evaluation iterations, for instance, to augment results not previously gained, or to retest the same videos, specified within it. Such dual storage approach (database and Test Plan file) facilitates simple access, filtering, and analysis of the evaluation results.

5 Experimental setup

Here, we describe an experimental setup that leverages the Mi-Go tool to use YouTube videos, across all categories, as a data to evaluate speech recognition models by comparing their output with human-made transcripts. The purpose of the experiment is to confirm whether the following setup (Mi-Go and YouTube as evaluation data source) will allow to evaluate the speech recognition models and obtain evaluation results similar to those obtained by the model creators.

5.1 Machine learning models used in the experiment

5.1.1 OpenAI’s Whisper

OpenAI, a company most notably recognized for its contribution to the field of artificial intelligence through the development of advanced large language models like GPT-3 and GPT-4, also developed state-of-the-art, general-purpose speech recognition models, which demonstrate exceptional performance in various applications, called Whisper [27].

Due to proven outstanding performance of that model family, as well as the fact that it has been made available under a open-source MIT Licence, we decided that, in our experiment, we will mainly focus on evaluation of the Whisper models. At this point, we should explain that the name “Mi-Go” comes from a novella by H.P. Lovecraft called “The Whisperer in Darkness”; thus, in our opinion, it would make a good name for the tool initially created to evaluate the Whisper models.

The model is based on a Transformer sequence-to-sequence architecture and is trained on a range of speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are collectively represented as a sequence of tokens to be predicted by the decoder, enabling a single model to supplant multiple stages of a conventional speech processing pipeline. The multitask training approach employs a series of unique tokens that act as task specifiers or classification targets [27].

Whisper model is available in five different sizes. Four of them (tiny, base, small, medium) having additional English-only versions, which―according to the creators― perform better when used in English-only applications [16]. Thus, in our research, we decided to use English-only model versions. The “large” model was improved twice; thus, in our experiment, we used two versions of “large” model―initial version, marked as “Whisper large-v1” and latest version, marked as “Whisper large-v3.” Each model offers a balance between speed and accuracy. The names of the used models, their approximate memory requirements and relative speeds are provided in Table 1.

Table 1 Comparison of Whisper models [16]

5.1.2 NVIDIA’s Conformer-Transducer X-Large

To prove that Mi-Go can be used for evaluation of different speech recognition models, apart from OpenAI’s Whisper, in our experiment, we also included models provided by other companies, like one developed by NVIDIA, built upon the Conformer-Transducer architecture, which blends the strengths of transformer and convolutional neural network architectures [13]. The “X-Large” variant of this model signifies its substantial size and capacity, enabling it to process and understand complex audio inputs with higher accuracy compared to its predecessors. It is distributed on Creative Commons BY 4.0 license [24].

When comparing the Conformer-Transducer X-Large model to OpenAI’s Whisper model, there are several key points of differentiation. The Whisper model, as we stated before, is based on a different architectural approach, primarily leveraging transformer neural networks. While both models aim to provide high accuracy in speech-to-text conversion, the NVIDIA model’s use of the Conformer-Transducer architecture may offer advantages in handling real-time or streaming audio applications. Additionally, the specific design choices in the NVIDIA model might result in better performance in certain scenarios, such as dealing with background noise or low-quality audio inputs [8].

Conformer-Transducer X-Large model is primarily used by NVIDIA in their open-source NeMo toolkit―designed to simplify the process of building, training, and fine-tuning complex neural network models, particularly for speech and natural language processing tasks [19]. To indicate this fact, as well as to use shorter name, in the following text, we will refer to the model as “NeMo Transducer Xlarge.”

5.1.3 ESPnet2 model

Similarly to NeMo, ESPnet2 (End-to-End Speech Processing Toolkit, version 2) is an open-source (using Apache 2.0 license) software toolkit designed for speech processing tasks, including automatic speech recognition, text-to-speech, and language modeling. Key features of ESPnet2 include its support for state-of-the-art machine learning models, its flexibility in handling different types of neural network architectures, and its comprehensive set of tools for training, evaluating, and fine-tuning models. ESPnet2 is widely used in the academic and research community for experimenting with novel ideas in speech processing and for developing systems that are more efficient and accurate in real-world applications [15].

Among different speech recognition models available for ESPnet2 toolkit, we chose one of the models trained by Shinji Watanabe, shortly called in this work as “ESPnet2 ConformerFootnote 6” to use it as a reference point for Whisper models evaluation in our experiment. Selection of this particular model was motivated by fact that it was used with success in official ESPnet2 demonstration material [31].

5.1.4 Facebook’s wav2vec2-base-960h

Facebook’s Wav2Vec 2.0 is an advanced neural network-based framework for speech recognition developed by Facebook AI researchers. It employs a self-supervised learning approach where the model is initially trained on a 53,000 of hours of unlabeled audio [5]. This pre-training allows the model to learn representations of speech from the raw audio itself. Once pre-trained, Wav2Vec 2.0-derived models can be fine-tuned with a smaller amount of labeled data to achieve high performance in transcribing speech. Model selected to be used in our experiment―“wav2vec2-base-960h”―was fine-tuned on 960 h of LibriSpeech [25] dataset on 16 kHz sampled speech audio.

5.2 Data collection and preparation

To begin the experiment, we instruct the TestPlan Generator, a component of the Mi-Go tool, via command line interface, to randomly fetch 7–10 videos per category listed in Table 2. What is important, we decided to use such number of videos basing only on available computing resources; number of videos used for evaluation is not restricted and can be freely set by other Mi-Go users.

Table 2 YouTube videos categories considered in the experiment

These videos are randomly selected, but basing on factors such as popularity, relevance, and the presence of human-generated subtitles, ensuring a diverse and high-quality dataset. The YouTube Data API is used to acquire the videos, while the youtube-transcript-api library retrieves their corresponding transcripts. Already fetched, the same set of videos is used to evaluate selected automatic speech recognition models (presented in Section 5.1). Full list of 141 videos used in experiment is provided in Appendix 4.

6 Results

To answer the research question, we used the proposed Mi-Go tool, to utilize 141 YouTube videos, representing all categories listed in Table 2, to evaluate selected automatic speech recognition models (as presented in Section 5.1) and collect Word Error Rate (WER) metrics as a result.

Statistics for collected Word Error Rate values for all evaluated models are presented in Table 3 and illustrated in Fig. 2. Detailed statistics of the WER value for each model by category are presented in Appendix 3. Results for different datasets compared to our YouTube-based results are gathered in Appendix 2.

Table 3 Word Error Rate [%] value statistics for all evaluated model versions
Fig. 2
Fig. 2
Full size image

Box plot of experiment results. Note the logarithmic scale

Whisper model characteristics, published by its authors [27], concern only “large-v1” model―thus, in Table 3, we presented WER statistics for that model with bold font.

As we can see, the median for “large-v1” model evaluation results is WER = 7.4%. The worst median of results for Whisper “large-v1” model presented by its creators was 19.6% (see Table 4 in Appendix 2). That result was achieved by using CORAAL speech recording dataset popularized by Gunter et al. [14]. Other datasets used to validate models by its creators were as follows: recordings of earning calls by Del Rio et al. [9], sets of recordings of online blogs and podcasts, and dataset containing recordings of The Late Show (sic!). Whisper “large-v1” model evaluation results from [27] compared to our results are presented in Table 4 in Appendix 2. By making this comparison, we can conclude that Whisper model evaluation, described in this work, produce similar results as the tests conducted by the Whisper model creators, using different data. Similarly, our results for ESPnet2 Conformer and wav2rec models are similar to those of other authors, achieved using different datasets (Tables 5 and 7 in Appendix 2). Low WER median of YouTube-based result of the Conformer-Transducer model compared to the results of other authors (Table 6 in Appendix 2) can be explained by the occurrence of the highest WER value for this model (18250%) due to fact that the model refused to transcribe the music video “All I Want For Christmas Is You” by Mariah Carey (other models did fine) possibly because of a model’s failureFootnote 7.

New version of Whisper model―“large-v3”―resulted with worse WER median than “large-v1” version. However, at the same time, “large-v3” resulted with much lower―when compared to “large-v1”―maximum WER value and standard deviation. Thus, we can interpret that result as indication of higher stability of “large-v3” outcomes when compared to older Whisper model version.

One can find large WER values for selected results, significantly different from the median. However, by reviewing the YouTube videos that were used for the tests that ended with high WER values, we can conclude that the reason for this is not due to a malfunction of the Mi-Go tool or speech recognition model. Instead, the high WER values are due to the actual discrepancy between the human-made subtitles attached to the video and those generated by the model. We found that such discrepancies occur due to several reasons:

  1. 1.

    Transcription errors. Humans, despite their proficiency, are not infallible and may make mistakes when transcribing speech to text. This could involve mishearing words or phrases, particularly in a noisy environment, during rapid speech, or when dealing with dialectal variations or accents. On the other hand, automatic speech recognition models can “hallucinate” under certain conditions, causing high WER values. For example, in our experiment, for one video containing little speechFootnote 8, Whisper “large-v1” model returned such transcription:

    I’m not a dog. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. I’m a cat. (...)

  2. 2.

    Interpretation differences. Subtitling is not always a direct one-to-one transcription process. The transcriber’s understanding and interpretation of the speech can influence the outcome. Homonyms, idiomatic expressions, cultural references, or ambiguous statements can all be interpreted differently depending on the transcriber’s knowledge and perspective.

  3. 3.

    Contextual adaptations. Subtitle makers often make deliberate changes to the text for various reasons. They may simplify or clarify speech to make it more accessible to the audience, especially if the speech is complex or jargon-filled. They may also modify the text to match reading speed constraints, given that text must be readable within the time it is displayed. Cultural adaptations may also be made to make the content more comprehensible to a specific audience (as a form of video’s localization).

  4. 4.

    Descriptive transcriptions. Some transcriptions go beyond the spoken content and provide descriptions of the visual elements in the video. These are often intended for visually impaired or blind viewers to provide them a more comprehensive understanding of the video content. Such case occurred with video resulted with second highest WER value in our experiment (WER = 12650%). While that video only consist of animal sounds, actual subtitles are as follows (original spelling)Footnote 9:

    Cats Cats are very cute animals Animals that are close and affectionate with people Cat breed is a species with relatively high fertility, giving birth to 2-3 litters of kittens a year New born kittens only weighs about 100g and fits easily in the palm of your hand Horses are smart, wise animals Mother horses as young as 3 years old can start breeding (...)

  5. 5.

    Search engine optimization (SEO). Some subtitles may be created or modified with the goal of improving the video’s visibility in search engine results. The inclusion of relevant keywords and phrases can make the video more likely to appear in search results related to those terms, hence enhancing the video’s discoverability. Here is an example of such subtitles from one of the fetched videosFootnote 10:

    The Animals, Funniest Animals Video, Funny Video, Funny Animals, Cats, Dogs, Funny Cats, Funny Dogs, Pets, Funny Pets, Funny, Cute, Cute Animals, Cute Pets, Funny Cat Video, Funny Dog Video, Funny Animals Life, Wow, Best Animals, Best Animals Video, Compilation, Funny Video Compilation, Kittens, Puppies, Try not to laugh, Best Animals 2023, Best of 2022, Cute Puppy, Funny Kitten, Animals International, Funny Animal Video.

By comparing model-made transcription to the existing human-made subtitles, discrepancies can be identified. Factors such as background noise, speaker accents, or low-quality audio can impact the model’s performance. Hence, although speech recognition models can help identify potential inaccuracies in subtitles, a degree of human oversight and validation is typically necessary to confirm and rectify these inaccuracies. From a different perspective, automated setup which utilizes Mi-Go and selected speech recognition model, can significantly help in detection of video subtitles misuse.

7 Conclusions and future work

In this paper, we have introduced Mi-Go, a lightweight and flexible tool for evaluating general-purpose speech recognition models and using YouTube’s vast and diverse content. Traditional evaluation methods, which employ curated datasets, may not capture the broad array of real-world scenarios, hence potentially limiting a model’s generalizability. Mi-Go, by leveraging YouTube’s dynamic content, offers an enriched platform for evaluating such models. An experiment was conducted, using randomly fetched 141 YouTube videos, demonstrating the usefulness of the Mi-Go tool in evaluation of model prediction performance and identification of discrepancies between model-generated transcriptions and human-made subtitles. The results underscore the necessity for human oversight in rectifying inaccuracies and the potential of the Mi-Go tool for enhancing speech recognition models’ robustness and adaptability.

While the Mi-Go tool demonstrates promising results in evaluating speech recognition models, several avenues for future work can further enhance its capabilities:

  1. 1.

    Expanding the tool to accommodate other data sources (like non-English YouTube videos or video hosting services other than YouTube), providing an even more diverse and representative set of audio samples for evaluation

  2. 2.

    Incorporating advanced techniques for data preprocessing and augmentation, which can help in simulating various real-world challenges, such as background noise and audio distortions

  3. 3.

    Developing a graphical user interface and API, making it easier for researchers and developers to integrate and utilize the Mi-Go tool in their projects

  4. 4.

    Extending the tool to support other tasks, such as speaker identification evaluation and language identification evaluation, in addition to automatic speech recognition evaluation

An important area for further work is the tool’s lack to handle audio characteristics such as noise, the number of speakers, accents, and the distance of the speaker. This limitation stems from the tool’s foundational approach, which uses a straightforward comparison between human-made YouTube subtitles and those generated by a speech recognition model. This approach inherently focuses on textual alignment without delving into the nuances of audio quality or speaker attributes.

To address the mentioned audio characteristics handling, an advanced feature could be integrated into the Mi-Go tool, employing audio analysis techniques to evaluate and adjust for different audio characteristics before the transcription process. This enhancement could involve the implementation of pre-processing algorithms capable of detecting and compensating for noise levels, identifying speaker count and accents, and adjusting for recording distance. Such improvements would not aim to refine the accuracy of the speech recognition, as it is not the tool’s purpose, but enrich Mi-Go’s speech recognition model evaluation results by adding possible root causes (such as high levels of noise or far-field speech) of potential poor model’s performance.

In the pursuit of excellence within the realm of rapid speech-to-text models development, currently, the Mi-Go tool is undergoing a rigorous and comprehensive testing process, embodying the highest standards of software quality assurance [10]. This meticulous testing is crucial not only to ensure the tool’s reliability and accuracy in evaluating speech-to-text models but also to guarantee an optimal user experience, free from technical glitches and usability hurdles. By subjecting Mi-Go to such thorough scrutiny, we aim to provide users with a seamless and efficient tool, facilitating effective and user-friendly interactions in the nuanced field of speech-to-text system evaluation.

We hope that Mi-Go tool will find wide application in both speech recognition machine learning model evaluation and detection of anomalies in existing video transcriptions.