Custom Collection of Scripted Utterance Speech Dataset

A leading company working in speech recognition and natural language processing technology approached us with the requirement of collecting a large-scale monologue speech dataset to be used for improving the performance of its speech recognition systems. The company wanted to create a highly accurate and diverse dataset in German and Spanish to train its AI systems.

Custom Collection of Scripted Utterance Speech Dataset

Overview

The following case study explores the challenges faced by a client who required a custom speech dataset to be collected from 1000 native individuals with diverse accents, dialects, genders, and age profiles. The client had specific requirements for the dataset, which included the recording of scripted sentences. This case study highlights the strategies and solutions implemented by our team to meet the client's needs and overcome the obstacles encountered during the dataset collection process.

Client's Method of Data Collection

They created a process to collect speech data while working with a small group of people, following the steps outlined below:

●They are sending 100 sentences in excel file to a specific person

●Then user will go to specific open source sound recorder and record each sentence at a time from this excel file

●Then user will download the recording and rename it

●And corresponding to that recording user will create a text file that contain that sentence and rename that too

●Once done with all recordings and file creation user will create one more text file containing all user metadata like age, gender, language, country etc

The Challenge

The initial challenge that the client came up with was that they didn’t have an efficient process to find, onboard, and manage such a big crowd. In addition to that, they wanted to collect the entire dataset in around 25 days in a proper format along with metadata.

Client's Data Collection Scenario

Through a comprehensive analysis of the current process, it has been observed that:

●Users are taking 1-1.5 minutes to record an 8-12 second prompt due to redundant tasks such as downloading, renaming, and organizing.

●Participants are demanding higher compensation than the allocated budget due to the aforementioned manual tasks.

●The manual and repetitive nature of these tasks increases the likelihood of errors throughout the process, resulting in more rejections and decreased participant engagement.

●Furthermore, the review process is equally burdensome and time-consuming.

The Solution

[Adapting Proprietary State of-the art tool to collect data and manage crowd]

To deal with this requirement, we used our mobile application Yugo, which made the entire process of recording very easy for the participants, and they were relieved of all the redundant tasks! With the use of mobile applications, we channel the participants' true potential into quality recording and automate the "prone-to-error" tasks.

Yugo is our state-of-the-art mobile application for collecting speech data.
Here's how it works:

●To get started, users can download the app from the Play Store. During sign-up, we collect information like age, gender, country, and dialect, which we later use as metadata.

●We no longer have to send a script to each individual for recording. Now, we can create a project in the admin panel, upload the scripts, and assign them to specific users.

●Once a user is assigned a batch of sentences, they can see their progress and status, including how many sentences they've completed and their QA status. The user will receive one sentence at a time on their screen and can record it, play it back, listen to it, re-do it if necessary, and submit it from the app.

●When a user has finished recording their assigned batch of sentences, we can assign the entire batch for review to QA from the admin panel. The reviewer listens and compares each recording with its corresponding sentence on the screen, one at a time, to ensure that the quality is sufficient. If it's not, the reviewer will reject the recording with a comment, and it will be automatically assigned to the specific user to re-record.

●We can also use the app to provide instructions and sample recordings for recorders and reviewers, ensuring that they know what we expect from them in terms of recording quality. These instructions are available at any time within the app.

To accomplish our client's goal, we onboarded native German and Spanish participants from our global crowd community for this project. We aim for an even distribution of participants across age groups and genders, as shown below:

Age Group

18-35
36-50
51+

Male

20%
20%
10%

Female
20%
20%
10%

In addition to that, we managed to collect all recordings with consistent technical features like a sample rate of 16 kHz and a bit depth of 16 bits. We provide all audio files in the wav format, which is a feasible format because of its lossless feature!

The Results

10k+

Delivered Assests

25 Days

Time Duration

1k+
Participants

2
Languages

We have successfully accomplished the task of completing the entire collection within a span of 25 days, including the Quality Assurance (QA) process. Our team has effectively constructed a varied and impartial dataset for Automatic Speech Recognition (ASR) training purposes, satisfying the requirements of our client. The implementation of the Yugo mobile application facilitated a significant decrease in the recording duration, from 1-1.5 minutes to 20-25 seconds, for a single sentence, resulting in an expedited and efficient workflow.

Our team has streamlined the entire process and ensured its accessibility to all users by automating repetitive tasks, resulting in a reduction in overall errors and rejections. By doing so, we have been able to maintain the task's fairness and interest for recorders, while also completing the project within the specified budget constraints.

Our Yugo administration system efficiently provides the client with the entire dataset, comprising all metadata, in a structured format. The final delivery includes comprehensive details such as the audio file link, corresponding text file link, recorder's age, gender, country, state, zip code, recording device information, date of recording and review, and Quality Assurance (QA) status.

Services Used

STATE-OF-THE-ART TOOL

[Yugo mobile application]

Collect any type of speech dataset with our state-of-the-art speech data collection application, Yugo, in a structured format in the shortest time possible.

Explore now

CUSTOM SPEECH DATA COLLECTION

[Scripted monologue speech dataset]

Collect any type of speech dataset with our global crowd community of 10,000+ people from 50+ countries in most languages

Learn More

STATE-OF-THE-ART TOOL

[Yugo mobile application]

Collect any type of speech dataset with our state-of-the-art speech data collection application, Yugo, in a structured format in the shortest time possible.

Explore now