Bahasa Healthcare Conversational Chat Dataset

This dataset features Bahasa text-based chat conversations between customers and call center agents, specifically focused on healthcare-related interactions. It covers real-world scenarios designed to reflect the authentic language, tone, and structure of Bahasa healthcare conversations. This dataset is ideal for training chatbots, smart assistants, and NLP models tailored to the healthcare domain.

About This OTS Dataset

Introduction

The Bahasa Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Bahasa-speaking regions.

Participant & Chat Overview

•

Participants: 150+ native Bahasa speakers from the FutureBeeAI Crowd Community

•

Conversation Length: 300–700 words per chat

•

Turns per Chat: 50–150 dialogue turns across both participants

•

Chat Types: Inbound and outbound

•

Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity

The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

•

Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•

Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

Language Diversity & Realism

This dataset reflects the natural flow of Bahasa healthcare communication and includes:

•

Authentic Naming Patterns: Bahasa personal names, clinic names, and brands

•

Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Bahasa formats

•

Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Bahasa-speaking regions

•

Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

Conversational Flow & Structure

Conversations range from simple inquiries to complex advisory sessions, including:

•General inquiries

•Detailed problem-solving

•Routine status updates

•Treatment recommendations

•Support and feedback interactions

Each conversation typically includes these structural components:

•Greetings and verification

•Information gathering

•Problem definition

•Solution delivery

•Closing messages

•Follow-up and feedback (where applicable)

This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

Data Format & Structure

Available in JSON, CSV, and TXT formats, each conversation includes:

•Full message history with clear speaker labels

•Participant identifiers

•Metadata (e.g., topic tags, region, sentiment)

•Compatibility with common NLP and ML pipelines

Applications

This dataset supports a wide range of AI and NLP use cases in the healthcare sector:

•Healthcare Chatbots & Voice Assistants

•Appointment Scheduling Automation

•Sentiment and Emotion Detection

•NER and Medical Entity Extraction

•Text Classification & Intent Detection

•Predictive Response Models

•Bahasa NLP Research in the Healthcare Domain

Ethical Collection & Data Security

•

Consent-Based Contribution: All participants provided informed consent

•

Privacy Compliant: No personally identifiable information is shared

•

Secure Data Handling: All data was collected and stored securely within FutureBeeAI's infrastructure

•

Ethical Standards: Adheres to best practices in AI ethics, healthcare data governance, and privacy protection

Dataset Expansion & Customization

The dataset is actively maintained and can be customized to meet specific needs:

•

Custom Annotations: Add NER tags, intent labels, sentiment scores, or medical category tags

•

Topic Expansion: Collect new chats for specific health areas (e.g., pediatrics, dermatology, telemedicine)

•

Region-Specific Data: Custom collection for different Bahasa-speaking countries and dialects

•

Multilingual Options: Extend to additional languages or cross-lingual training needs

Licensing

This dataset is developed and owned by FutureBeeAI and is available under a commercial license. Flexible licensing terms are available for enterprise, startup, and academic use.