Urdu Scripted Monologue Speech Dataset for the Healthcare Domain

The audio dataset comprises scripted monologue speech data in the Healthcare domain, featuring native Urdu speakers from Pakistan. It includes speech data, detailed metadata, and accurate transcriptions.

About this Off-the-shelf Speech Dataset

Introduction

Introducing the Urdu Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Urdu language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

Speech Data

This dataset includes over 6,000 high-quality scripted audio prompts recorded in Urdu, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

•Participant Diversity

•

Speakers: 60 native Urdu speakers.

•

Regional Balance: Participants are sourced from multiple regions across Pakistan, reflecting diverse dialects and linguistic traits.

•

Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications

•

Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•

Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•

Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•

Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage

The prompts span a broad range of healthcare-specific interactions, such as:

•Patient check-in and follow-up communication

•Appointment booking and cancellation dialogues

•Insurance and regulatory support queries

•Medication, test results, and consultation discussions

•General health tips and wellness advice

•Emergency and urgent care communication

•Technical support for patient portals and apps

•Domain-specific scripted statements and FAQs

Contextual Depth

To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

•

Names: Gender- and region-appropriate Pakistan names

•

Addresses: Varied local address formats spoken naturally

•

Dates & Times: References to appointment dates, times, follow-ups, and schedules

•

Medical Terminology: Common medical procedures, symptoms, and treatment references

•

Numbers & Measurements: Health data like dosages, vitals, and test result values

•

Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

Transcription

Every audio recording is accompanied by a verbatim, manually verified transcription.

•

Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•

Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•

Quality Control: Transcriptions are created and reviewed by native Urdu transcribers to ensure precision and consistency.

Metadata

Comprehensive metadata is included for each audio clip and participant, providing full traceability and analysis capabilities.

•

Participant Metadata: Unique speaker ID, age, gender, country, region/state, and dialect

•

Recording Metadata: Text transcript, Recording environment details, Device specifications, Audio format, sample rate, and bit depth

This level of detail allows developers to fine-tune models for regional accents, demographics, and acoustic variations.

Applications & Use Cases

This dataset supports a wide array of healthcare-related AI and speech technology use cases:

•

ASR Model Training: Improve model accuracy for medical voice input and queries.

•

Voice Synthesis & TTS: Train synthetic voice models for interactive health applications.

•

Voice Assistants: Build intelligent healthcare bots that speak Urdu.

•

Medical Chatbots: Enable more accurate patient communication through chatbot systems.

•

Entity Recognition (NER): Teach models to detect key medical data like drug names, appointment dates, and symptoms.

•

NLP & Language Understanding: Enhance downstream tasks like sentiment analysis and medical intent classification.

Secure & Ethical Collection

The dataset was created using FutureBeeAI’s proprietary platform, “Yugo,” ensuring full compliance and security throughout the process.

•Data collection was fully consented, anonymized, and conducted ethically.

•No personally identifiable information (PII) is captured in any part of the dataset.

•The dataset remains securely stored and processed on our platform, in accordance with international data protection standards.

License

This Urdu Scripted Monologue Speech Dataset for the Healthcare Domain is available for commercial licensing, enabling you to confidently develop and deploy speech AI solutions in the medical and healthcare sectors.