We Use Cookies!!!
We use cookies to ensure that we give you the best experience on our website. Read cookies policies.
A leading company working in speech recognition and natural language processing technology approached us with the requirement of collecting a large-scale monologue speech dataset to be used for improving the performance of its speech recognition systems. The company wanted to create a highly accurate and diverse dataset in German and Spanish to train its AI systems.
The following case study explores the challenges faced by a client who required a custom speech dataset to be collected from 1000 native individuals with diverse accents, dialects, genders, and age profiles. The client had specific requirements for the dataset, which included the recording of scripted sentences. This case study highlights the strategies and solutions implemented by our team to meet the client's needs and overcome the obstacles encountered during the dataset collection process.
They created a process to collect speech data while working with a small group of people, following the steps outlined below:
The initial challenge that the client came up with was that they didn’t have an efficient process to find, onboard, and manage such a big crowd. In addition to that, they wanted to collect the entire dataset in around 25 days in a proper format along with metadata.
Through a comprehensive analysis of the current process, it has been observed that:
To deal with this requirement, we used our mobile application Yugo, which made the entire process of recording very easy for the participants, and they were relieved of all the redundant tasks! With the use of mobile applications, we channel the participants' true potential into quality recording and automate the "prone-to-error" tasks.
Yugo is our state-of-the-art mobile application for collecting speech data.
Here's how it works:
To accomplish our client's goal, we onboarded native German and Spanish participants from our global crowd community for this project. We aim for an even distribution of participants across age groups and genders, as shown below:
18-35
36-50
51+
20%
20%
10%
20%
20%
10%
In addition to that, we managed to collect all recordings with consistent technical features like a sample rate of 16 kHz and a bit depth of 16 bits. We provide all audio files in the wav format, which is a feasible format because of its lossless feature!
Delivered Assests
Time Duration
Participants
Languages
We have successfully accomplished the task of completing the entire collection within a span of 25 days, including the Quality Assurance (QA) process. Our team has effectively constructed a varied and impartial dataset for Automatic Speech Recognition (ASR) training purposes, satisfying the requirements of our client. The implementation of the Yugo mobile application facilitated a significant decrease in the recording duration, from 1-1.5 minutes to 20-25 seconds, for a single sentence, resulting in an expedited and efficient workflow.
Our team has streamlined the entire process and ensured its accessibility to all users by automating repetitive tasks, resulting in a reduction in overall errors and rejections. By doing so, we have been able to maintain the task's fairness and interest for recorders, while also completing the project within the specified budget constraints.
Our Yugo administration system efficiently provides the client with the entire dataset, comprising all metadata, in a structured format. The final delivery includes comprehensive details such as the audio file link, corresponding text file link, recorder's age, gender, country, state, zip code, recording device information, date of recording and review, and Quality Assurance (QA) status.
FutureBeeAI helped us collect our dream dataset in the shortest time possible. Yugo made the entire process very easy.