Introduction
The Bahasa Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
Speech Data
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
•Wake words followed by command phrases
Participant Diversity
        •
        
        Speakers:
         50 native Bahasa speakers from the FutureBeeAI community
        
         
        •
        
        Regions:
         Participants from various Indonesia provinces, ensuring broad coverage of accents and dialects
        
         
        •
        
        Demographics:
         Ages 18–70; 60% male and 40% female participants
        
         Recording Details
        •
        
        Type:
         Scripted wake words and command phrases
        
         
        •
        
        Duration:
         1 to 15 seconds per clip
        
         
        •
        
        Format:
         WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz
        
         Dataset Diversity
•Wake Word Types
        •
        
        Automobile Wake Words:
         Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
        
         
        •
        
        Voice Assistant Wake Words:
         Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
        
         
        •
        
        Home Appliance Wake Words:
         Hi LG, Ok LG, Hello Lloyd, and more
        
         
        •
        
        Automobile:
         Play music, check directions, voice search, provide feedback, and more
        
         
        •
        
        Voice Assistant:
         Ask general questions, make calls, control devices, shopping, manage calendars, and more
        
         
        •
        
        Home Appliances:
         Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
        
         •Background traffic noise
•People talking in the background
This diversity ensures robust training for real-world voice assistant applications.
Metadata
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
            •
            
            Participant Metadata:
             Unique ID, age, gender, region, accent, dialect
            
             
            •
            
            Recording Metadata:
             Transcript, environment, pace, device used, sample rate, bit depth, file format
            
             Use Cases & Applications
            •
            
            Voice Assistant Activation:
             Train models to accurately detect and trigger based on wake words
            
             
            •
            
            Smart Home Devices:
             Enable responsive voice control in smart appliances
            
             
            •
            
            Automotive Voice Control:
             Power voice-based commands for navigation, entertainment, and system control
            
             
            •
            
            Wearables:
             Enhance hands-free operation with precise wake word recognition
            
             
            •
            
            Consumer Electronics:
             Improve voice interactivity across TVs, IoT devices, and more
            
             
            •
            
            Generative AI Integration:
             Use wake words to trigger context-aware conversational AI systems
            
             Data Security & Ethics
•Collected via FutureBeeAI’s proprietary Yugo platform
•Maintained in a secure and confidential environment
•Full participant consent ensured; no personally identifiable information included
•Compliant with ethical data collection standards
Customization Options
We offer continuous updates and flexible customization to suit your project needs:
            •
            
            Environmental Customization:
             Recordings in specific background conditions
            
             
            •
            
            Sampling Rate Options:
             Custom data at 8 kHz, 16 kHz, 44.1 kHz, or 48 kHz
            
             
            •
            
            Pace Adjustments:
             Slow, normal, or fast speech
            
             
            •
            
            Device-Specific Recording:
             Capture using specific brands or operating systems
            
             
            •
            
            Custom Wake Words/Commands:
             Record your custom prompts using our community network
            
             License
This dataset is developed by FutureBeeAI and is available for commercial use.