Bulgarian Scripted Monologue Speech Dataset for General Domain

The audio dataset comprises scripted monologue speech data in the General domain, featuring native French speakers from Canada. It includes speech data, detailed metadata, and accurate transcriptions.

About this Off-the-shelf Speech Dataset

Introduction

The Bulgarian Scripted Monologue Speech Dataset for the General Domain is a carefully curated resource designed to support the development of Bulgarian language speech recognition systems. This dataset focuses on general-purpose conversational topics and is ideal for a wide range of AI applications requiring natural, domain-agnostic Bulgarian speech data.

Speech Data

This dataset features over 6,000 high-quality scripted monologue recordings in Bulgarian. The prompts span diverse real-life topics commonly encountered in general conversations and are intended to help train robust and accurate speech-enabled technologies.

•Participant Diversity

•

Speakers: 60 native Bulgarian speakers

•

Regions: Broad regional coverage ensures diverse accents and dialects

•

Demographics: Participants aged 18 to 70, with a 60:40 male-to-female ratio

•Recording Specifications

•

Recording Type: Scripted monologues and prompt-based recordings

•

Audio Duration: 5 to 30 seconds per file

•

Format: WAV, mono channel, 16-bit, 8 kHz & 16 kHz sample rates

•

Environment: Clean, noise-free conditions to ensure clarity and usability

Topic Coverage

The dataset covers a wide variety of general conversation scenarios, including:

•Daily Conversations

•Topic-Specific Discussions

•General Knowledge and Advice

•Idioms and Sayings

Contextual Features

To enhance authenticity, the prompts include:

•

Names: Male and female names specific to different Bulgaria regions

•

Addresses: Commonly used address formats in daily Bulgarian speech

•

Dates & Times: References used in general scheduling and time expressions

•

Organization Names: Names of businesses, institutions, and other entities

•

Numbers & Currencies: Mentions of quantities, prices, and monetary values

Each prompt is designed to reflect everyday use cases, making it suitable for developing generalized NLP and ASR solutions.

Transcription

Every audio file in the dataset is accompanied by a verbatim text transcription, ensuring accurate training and evaluation of speech models.

•

Content: Exact match to the spoken audio

•

Format: Plain text (.TXT), named identically to the corresponding audio file

•

Quality Control: All transcripts are validated by native Bulgarian transcribers

Metadata

Rich metadata is included for detailed filtering and analysis:

•

Speaker Metadata: Unique speaker ID, age, gender, region, and dialect

•

Audio Metadata: Prompt transcript, recording setup, device specs, sample rate, bit depth, and format

Applications & Use Cases

This dataset can power a variety of Bulgarian language AI technologies, including:

•

Speech Recognition Training: ASR model development and fine-tuning

•

Voice Synthesis: Training data for TTS and voice cloning models

•

Voice Assistants: Building general-purpose Bulgarian voice assistants

•

Entity Recognition: Identifying names, numbers, and key terms

•

Language Understanding: Training models for tasks like sentiment analysis, topic classification, and semantic parsing

Ethical & Secure Data Collection

All data was collected using FutureBeeAI’s proprietary Yugo platform

•Data remained secure and was never shared externally during the process

•Participant consent and ethical guidelines were strictly followed

•No personally identifiable information (PII) is included in the dataset

License

This dataset is developed and owned by FutureBeeAI and is available for commercial use, offering high-value resources for enterprises and research organizations developing Bulgarian speech technologies.

Use Cases

Use of scripted speech monologues datasets for Automatic Speech Recognition

ASR

Conversational AI

Chatbot

Use of scripted speech monologues datasets for TTS

TTS

Speech Analytics

Mobile Speech

Dataset Sample(s)

TRANSCRIPTION

SPEAKER	DURATION	TRANSCRIPT
Male(20)	00:00:05	La culture, c'est comme la confiture, moins on en a, plus on l'étale.
Male(42)	00:00:06	La gestion se caractérise dans ces situations par le fait de confier à autrui, ou à soi-même, des affaires à gérer.
Male(43)	00:00:04	En politique une absurdité n'est pas un obstacle
Male(29)	00:00:06	la hausse des températures affecte les ressources d’eau, tant en termes de quantité que de qualité
Male(34)	00:00:05	Chaque années des millions de tonnes de terres arables disparaissent