Introduction
Introducing the Gujarati Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Gujarati language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Gujarati, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers:
60 native Gujarati speakers.
•
Regional Balance:
Participants are sourced from multiple regions across Gujarat, reflecting diverse dialects and linguistic traits.
•
Demographics:
Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
•Recording Specifications
•
Nature of Recordings:
Scripted monologues based on healthcare-related use cases.
•
Duration:
Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
•
Audio Format:
WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
•
Environment:
Clean and echo-free spaces ensure clear and noise-free audio capture.
Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names:
Gender- and region-appropriate Gujarat names
•
Addresses:
Varied local address formats spoken naturally
•
Dates & Times:
References to appointment dates, times, follow-ups, and schedules
•
Medical Terminology:
Common medical procedures, symptoms, and treatment references
•
Numbers & Measurements:
Health data like dosages, vitals, and test result values
•
Healthcare Institutions:
Names of clinics, hospitals, and diagnostic centers
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content:
The transcription mirrors the exact scripted prompt recorded by the speaker.
•
Format:
Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
•
Quality Control:
Transcriptions are created and reviewed by native Gujarati transcribers to ensure precision and consistency.
Metadata
Comprehensive metadata is included for each audio clip and participant, providing full traceability and analysis capabilities.
•
Participant Metadata:
Unique speaker ID, age, gender, country, region/state, and dialect
•
Recording Metadata:
Text transcript, Recording environment details, Device specifications, Audio format, sample rate, and bit depth
This level of detail allows developers to fine-tune models for regional accents, demographics, and acoustic variations.
Applications & Use Cases
This dataset supports a wide array of healthcare-related AI and speech technology use cases:
•
ASR Model Training:
Improve model accuracy for medical voice input and queries.
•
Voice Synthesis & TTS:
Train synthetic voice models for interactive health applications.
•
Voice Assistants:
Build intelligent healthcare bots that speak Gujarati.
•
Medical Chatbots:
Enable more accurate patient communication through chatbot systems.
•
Entity Recognition (NER):
Teach models to detect key medical data like drug names, appointment dates, and symptoms.
•
NLP & Language Understanding:
Enhance downstream tasks like sentiment analysis and medical intent classification.
Secure & Ethical Collection
The dataset was created using FutureBeeAI’s proprietary platform, “Yugo,” ensuring full compliance and security throughout the process.
•Data collection was fully consented, anonymized, and conducted ethically.
•No personally identifiable information (PII) is captured in any part of the dataset.
•The dataset remains securely stored and processed on our platform, in accordance with international data protection standards.
License
This Gujarati Scripted Monologue Speech Dataset for the Healthcare Domain is available for commercial licensing, enabling you to confidently develop and deploy speech AI solutions in the medical and healthcare sectors.