Servo Summary icon Servo Summary All-in-one servo view showing and editing servo positions and speeds; add servos and resize the control to fit. Try it →
 
Thumbnail

Voice Activity Detection

by Synthiam

Real-time VAD using 300-3400 Hz FFT to trigger scripts on speech start/stop, with sensitivity tuning, live audio level graph, and TTS pause.

Requires ARC v8 (Updated 3/31/2026)
Compatible with: Compatible with Microsoft Windows 10 or 11 Windows

How to add the Voice Activity Detection robot skill

  1. Load the most recent release of ARC (Get ARC).
  2. Press the Project tab from the top menu bar in ARC.
  3. Press Add Robot Skill from the button ribbon bar in ARC.
  4. Choose the Audio category tab.
  5. Press the Voice Activity Detection icon to add the robot skill to your project.

Don't have a robot yet?

Follow the Getting Started Guide to build a robot and use the Voice Activity Detection robot skill.


How to use the Voice Activity Detection robot skill

The Voice Activity Detection (VAD) skill listens to your microphone in real time and fires scripts when speech begins or ends. It uses FFT (Fast Fourier Transform) analysis on the 300-3400 Hz speech band to distinguish voice from background noise, filtering out short impulsive sounds like knocks and clicks.

Typical use: trigger a speech recognition skill when someone starts talking, and stop it when they go silent.

Key Features:

  • Speech Detection:
    • Detects when speech starts (Speech Begin) and stops (Speech End).
  • Customizable Actions:
    • Allows users to attach custom scripts that execute automatically when speech starts or stops. For example, you can trigger robot movements, lights, or other interactions based on speech activity.
  • Real-Time Audio Visualization:
    • Displays a live graph of the detected speech level, giving a visual representation of the audio activity.
  • Adjustable Sensitivity:
    • Includes settings to fine-tune detection parameters, such as silence thresholds, for optimal performance in various environments.

Practical Applications:

  • Interactive Robots:
    • Enable your robot to react to speech dynamically, such as greeting people when they start talking or pausing when they stop.
  • Hands-Free Control:
    • Use speech detection to trigger actions without needing additional input devices.
  • Speech Analysis:
    • Visualize audio levels for debugging or fine-tuning robot behavior in different environments.

This robot skill combines robust speech detection with seamless integration into Synthiam ARC, empowering your robot to understand better and respond to its surroundings!

Main Window

Voice Detected The graph will display green when the voice is detected, and the voice start script will be executed.

User-inserted image

Voice Not Detected The display will be red when a human voice is absent, and the voice end script will execute.

User-inserted image

Pause You can pause the detection of speech by pressing the pause checkbox on the main form. The PAUSE is also enabled when any of the Audio.say() scripting commands are used and unpaused when the speaking has been completed. This ensures the VAD does not detect the robot speaking and triggers a false positive.


Graph

User-inserted image

A real-time level meter shows the detected audio energy relative to the sensitivity threshold:

Color Meaning
Red Listening - no speech currently detected
Green Speech is active
Yellow Detection is paused

The graph bar fills from 0-100% as energy approaches and exceeds the threshold.

Pause Checkbox

Checking Pause suspends all voice detection. The graph stops updating and no scripts will fire. Uncheck to resume.

The plugin automatically pauses while the robot's text-to-speech (TTS) is speaking, then restores your previous pause state when TTS finishes. This prevents the robot's own voice from triggering detection.

Log Tab

Displays diagnostic output. Every ~50 audio frames (~2.5 seconds) a line is written:

[VAD] energy=0.000042  threshold=0.000050  ratio=0.840

Use these values to calibrate Sensitivity - the energy and threshold should be close together for good triggering.


Configuration

User-inserted image

Open configuration by clicking the gear/config button on the skill panel.

Sensitivity

Default: 0.00005

The FFT energy threshold for speech detection. The VAD compares the smoothed energy of the 300-3400 Hz band against this value.

  • Lower value more sensitive (triggers on quieter sounds)
  • Higher value less sensitive (requires louder speech to trigger)

Use the diagnostic log to tune this. If energy is consistently below threshold when speaking, lower the value. If background noise is triggering false detections, raise it.

Silence Duration

Default: 500 ms

How long (in milliseconds) silence must persist after speech before the Speech End event fires. Increase this if the script fires too early during natural pauses in speech. Decrease it for faster response after someone stops talking.

Minimum Speech Duration

Default: 150 ms

How long (in milliseconds) energy must continuously exceed the threshold before Speech Begin fires. This filters out short impulsive sounds (clicks, knocks, coughs) that are not sustained speech. Increase this to be more aggressive about ignoring non-speech sounds.


Scripts

Both scripts are configured inside the Config dialog.

Speech Start Script

Runs once when speech begins (after energy exceeds Sensitivity for at least Minimum Speech Duration milliseconds).

Typical use:

// Start a speech recognition skill
ControlCommand("Speech Recognition", "Start Listening");

Speech End Script

Runs once when silence has persisted for the Silence Duration after speech was active.

Typical use:

// Stop the speech recognition skill
ControlCommand("Speech Recognition", "Stop Listening");

How Detection Works

  1. Microphone audio is captured at 16 kHz, 16-bit mono, in 50 ms chunks.
  2. Each chunk is optionally filtered with a high-pass BiQuad filter (cutoff 300 Hz) to remove low-frequency rumble.
  3. Samples accumulate into a 1024-sample FFT buffer. When full, a Hamming window is applied and the FFT is computed.
  4. The average magnitude of all FFT bins in the 300-3400 Hz speech band is calculated.
  5. That energy value is exponentially smoothed (factor 0.3) to reduce frame-to-frame jitter.
  6. The smoothed energy is compared to Sensitivity:
    • If above threshold for Minimum Speech Duration SpeechBegin fires.
    • If below threshold for Silence Duration after speech was active SpeechEnd fires.

Tuning Guide

Robot is not triggering on speech

  1. Check the diagnostic log. Note the energy value while speaking.
  2. Lower Sensitivity until it is slightly below the energy value when speaking.
  3. If still not triggering, ensure the correct microphone is selected in Windows Sound settings.

Too many false triggers (background noise)

  1. Note the energy value during silence in the diagnostic log.
  2. Raise Sensitivity to just above that noise floor value.
  3. Increase Minimum Speech Duration (e.g., 200-300 ms) to require more sustained sound.

Speech End fires too quickly (mid-sentence pauses)

Increase Silence Duration (e.g., 800-1500 ms).

Speech End fires too slowly after speaking stops

Decrease Silence Duration (e.g., 300 ms).

Robot triggers on its own voice

This is handled automatically - detection pauses when TTS is active. If issues persist, verify that the TTS events are wired correctly (the plugin uses EZBManager.PrimaryEZB.SpeechSynth).


Technical Specifications

Property Value
Sample rate 16,000 Hz
Bit depth 16-bit mono
Buffer interval 50 ms
FFT size 1024 samples
Speech band 300-3400 Hz
Window function Hamming
Pre-filter High-pass at 300 Hz, Q=0.707
Energy smoothing Exponential, =0.3

Control Commands for the Voice Activity Detection robot skill

There are Control Commands available for this robot skill which allows the skill to be controlled programmatically from scripts or other robot skills. These commands enable you to automate actions, respond to sensor inputs, and integrate the robot skill with other systems or custom interfaces. If you're new to the concept of Control Commands, we have a comprehensive manual available here that explains how to use them, provides examples to get you started and make the most of this powerful feature.

Control Command Manual

Control Commands

Other skills or scripts can control VAD behavior using ControlCommand().

Command Effect
Pause Suspends detection (same as checking the Pause checkbox)
Resume Re-enables detection (same as unchecking Pause)

Example from another skill's script:

// Pause VAD while playing audio
ControlCommand("Voice Activity Detection", "Pause");

// ... do something ...

// Re-enable VAD
ControlCommand("Voice Activity Detection", "Resume");

ARC Pro

Upgrade to ARC Pro

With ARC Pro, your robot is not just a machine; it's your creative partner in the journey of technological exploration.

#1  

This one seems interesting. I wonder if my TV or radio would set it off all the time

#2  

I can see great uses for this, particularly in interactive robots.  When it hears a voice it could turn on face detection and start looking around until it sees a face.  (would be super cool if it could support multiple microphones and compare the levels so it knows what direction to start looking.  might not even need face detection for that, but I don't think Windows even deals well with having multiple microphones on at the same time, so probably beyond the scope of this project...).

Alan

Author Avatar
PRO
Synthiam
LinkedIn Thingiverse Twitter YouTube GitHub
#3  

You could combine this skill with the Kinect 369 depth skill. It returns a variable with the angle of audio.

Author Avatar
PRO
Synthiam
LinkedIn Thingiverse Twitter YouTube GitHub
#4  

I should add the reason our client asked for this skill is to have fluent conversational dialog with their robots. Not sure what speech recognition they’re using. But I do know they’re using google dialog flow for nlp

#5  

Hmm.   Too bad your results with the Kinect for navigation haven't been as promising as Realsense.  I don't think I need the ability badly enough for the expense of both (or a Kinect and a Lidar).  I'll keep it in mind though.  I think I recall seeing something on one of hte robot part's sites about a sound direction finder.  If I come across it again I'll see whether it is something that is inexpensive and a skill could be written for.

Author Avatar
PRO
Synthiam
LinkedIn Thingiverse Twitter YouTube GitHub
#6  

Fixed a bug with an error when closing the skill

#7  

I would love to see this have an added feature sometime that allows it to check output audio instead of input audio. For talking robots,  the ServoTalk skills will estimate how long a jaw should move based on a text string's length and content.  They tend not to be very accurate and often underestimate or overestimate the time it takes for the Text-to-speech app to run.   I thought this skill could help circumvent this limitation by keeping an ear open for when the audio starts and ends.  I also thought of adding some natural neck animation while my robot speaks.  The problem is that I have no way to trigger my 'stay alive' script when the audio begins and stop it when text to speech ends. This skill almost got me there, then I realized it worked only with the mic input.  A check box that switches what it listens too from MIC to LINE OUT would be super cool!

#8  

v7 has been updated with some additional fine-tuning for more accurate detection and completion of speech.