Dutch General Conversation Speech Dataset

This dataset captures real-world, unscripted conversations between native Dutch speakers. It includes detailed metadata and high-quality manual transcriptions, making it ideal for building accurate, human-like speech recognition and conversational AI systems.

About this Off-the-shelf Speech Dataset

Introduction

Welcome to the Dutch General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Dutch speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Dutch communication.

Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Dutch speech models that understand and respond to authentic Dutch accents and dialects.

Speech Data

The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Dutch. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

•Participant Diversity:

•

Speakers: 60 verified native Dutch speakers from FutureBeeAI’s contributor community.

•

Regions: Representing various provinces of Netherlands to ensure dialectal diversity and demographic balance.

•

Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:

•

Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•

Duration: Each conversation ranges from 15 to 60 minutes.

•

Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•

Environment: Quiet, echo-free settings with no background noise.

Topic Diversity

The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

•Sample Topics Include:

•Family & Relationships

•Food & Recipes

•Education & Career

•Healthcare Discussions

•Social Issues

•Technology & Gadgets

•Travel & Local Culture

•Shopping & Marketplace Experiences, and many more.

Transcription

Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

•Transcription Highlights:

•Speaker-segmented dialogues

•Time-coded utterances

•Non-speech elements (pauses, laughter, etc.)

•High transcription accuracy, achieved through double QA pass, average WER < 5%

These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

Metadata

The dataset comes with granular metadata for both speakers and recordings:

•

Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•

Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

Usage and Applications

This dataset is a versatile resource for multiple Dutch speech and language AI applications:

•

ASR Development: Train accurate speech-to-text systems for Dutch.

•

Voice Assistants: Build smart assistants capable of understanding natural Dutch conversations.

•

Conversational AI: Develop chatbots and voicebots for multilingual or dialectal Dutch audiences.

•

Speech Analytics: Extract patterns, detect topics, and evaluate speaker behavior.

•

Generative Voice AI: Enable real-life dialogue synthesis or summarization with native-sounding output.

Secure and Ethical Collection

•All data was collected using “Yugo,” FutureBeeAI’s proprietary collection and transcription platform.

•Data remained within a secure environment throughout the process.

•Collected in compliance with strict privacy, consent, and ethical guidelines.

•No personally identifiable information is included in any recording or transcript.

•Free of copyrighted content. Safe for commercial and research use.

Customization and Updates

We continuously enrich this dataset with new, naturally captured conversations. Additionally, we support project-specific data customization:

•Available Customization:

•

Acoustic Conditions: In-car, restaurant, outdoor, or noisy environments on request.

•

Sampling Rate: Custom WAV files at 8kHz to 48kHz.

•

Transcription Guidelines: Tailored formatting, annotation levels, or QA standards.

License

This Dutch General Conversation Dataset is created by FutureBeeAI and is available for commercial licensing.

Use Cases

Use of speech data for Automatic Speech Recognition

ASR

Conversational AI

Chatbot

Language Modelling

TTS

Speech Analytics

Dataset Sample(s)

Dataset Details

Language

Dutch

Language code

Country

Netherlands

Accents

Groningen, Limburg ...moreNoord Brabant (Brabants), Noord Holland, Overijsel, ABN, Friesland, Gelderland

Gender Distribution

M:55, F:45

Age Group

18-70 Years

File Details

Environment

Silent, Noisy

Bit Depth

16 bit

Format

wav

Sample rate

16khz

Channel

Stereo (dual-channel, separated speakers)

Audio file duration

15-60 minutes

Read the License Terms

Browse FAQs

Similar to General Conversation Speech Datasets

Marathi (India)

Marathi General Conversation Speech Data

Spontaneous two-speaker general conversations in Marathi

60 Speech Hours

80 People

ASR

Conversational AI

Korean (South Korea)

Korean General Conversation Speech Data

Spontaneous two-speaker general conversations in Korean

50 Speech Hours

70 People

ASR

Conversational AI

Bengali (Bangladesh)

Bengali (Bangladesh) General Conversation Speech Data

Spontaneous two-speaker general conversations in Bengali (Bangladesh)

50 Speech Hours

70 People

ASR

Conversational AI

Norwegian (Norway)

Norwegian General Conversation Speech Data

Spontaneous two-speaker general conversations in Norwegian

50 Speech Hours

70 People

ASR

Conversational AI

View All

Dutch Retail & E-com CC Speech Data

Retail & E-commerce call center audio data in Dutch.

30 Speech Hours

60 People

Call Center Conversational AI

ASR

Dutch (Netherlands)

Dutch Real Estate Scripted Monologue Speech Data

Audio recordings of scripted prompts in Dutch language for Real Estate domain.

6000+ prompts

60+ people

ASR

Conversational AI

Dutch (Netherlands)

Dutch Telecom CC Speech Data

Telecom call center audio data in Dutch.

30 Speech Hours

60 People

Call Center Conversational AI

ASR

Dutch (Netherlands)

Dutch Healthcare CC Speech Data

Healthcare call center audio data in Dutch.

30 Speech Hours

60 People

Call Center Conversational AI

ASR

View All

Need datasets for a specific AI/ML use case?
Don't worry, we've got you covered! 👍

Explore Our Latest Insightful Blog

Dutch General Conversation Speech Dataset

About this Off-the-shelf Speech Dataset

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Secure and Ethical Collection

Customization and Updates

License

Use Cases

Samples will be available soon!

Dataset Details

File Details

Marathi General Conversation Speech Data

Korean General Conversation Speech Data

Bengali (Bangladesh) General Conversation Speech Data

Norwegian General Conversation Speech Data

Dutch Retail & E-com CC Speech Data

Dutch Real Estate Scripted Monologue Speech Data

Dutch Telecom CC Speech Data

Dutch Healthcare CC Speech Data