
PRO
afcorson
Australia
Asked

Silence Detection For Bing Speech Recognition Skill
I thought I had requested this feature before, but I can't find any record of it. I would like the Bing Speech Recognition Skill to detect when a person stops talking and then stop recording. VAD is not that reliable and limiting the recording length is a clumsy way to stop the recording. For example if the recording limit is set to 7 secs and the user just says 'hello robot', they would be waiting several seconds before getting a response, by which time they have probably walked away.
Related Controls
Bing Speech Recognition
Voice Activity Detection
The VAD robot skill will be helpful for your usage. The manual for it is here: https://synthiam.com/Support/Skills/Audio/Voice-Activity-Detection?id=20215
An example provided will start/stop a speech recognition robot skill upon detecting speech. In your case, add the ControlCommand() to instruct Bing to Start Listening when a voice is detected. Add another ControlCommand() to Stop Listening when the voice ends. You can add these commands to the scripts of the VAD robot skill.
What you request from Microsoft Bing's Speech Recognition does not exist further than what is presented. There are limitations to software created by third parties. In the case of a limitation that does not fit your requirements, an additional skill is necessary.
I made no request relating Microsoft's Bing Speech Recognition. The feature request relates to ARC's Bing Speech Recognition. My robot only listens when WaitForSpeech is invoked, or Bing Speech Recognition is started. I have no need to use the Speech Recognition Skill. The VAD Skill may well help stop Bing Speech from listening when no sound is detected. I will do some testing. Also I found no provision to adjust sensitivity in the VAD Skill.
Ah, the support agent was correct. Microsoft makes Bing Speech Recognition. The company/manufacturer of the robot skill is listed on its respective manual pages.
This is a good solution since the Microsoft Bing Speech Recognition can't be modified. Give their suggestion a read again, it makes sense.
I don't know what "sensitivity" you'd be looking for. Speech is not a sensitivity value because it's either there or not. I think the value of "how long of silence" might be helpful. You can read how VAD works here: https://en.wikipedia.org/wiki/Voice_activity_detection
It's a pretty complicated thing. I think most people think of sound as a wave with a volume level. But it's a waveform with complex peaks and valleys that have a unique signature to the type of sound. If you examine the waveforms, the sound of a human voice is different than that of a car, etc. There's a lot math in there - and sensitivity isn't a thing for something like that. The algorithm is "yes" or "no", which will also present false positives for both cases because nothing's perfect.
Also, the other thing to consider is understanding why all your Google Home/Alexa/Siri devices have Wakewords. The device is not listening for speech. It's listening for a particular pattern that matches one stored pattern to be triggered. That's why VAD is such a tricky thing, and Alexa, Siri, Google Home, etc, do not use VAD. They use a wake word.
@Athena give some details on VAD to expand on what I wrote.
That is correct. The robot skill's name is Bing Speech Recognition, but the company that makes it is Microsoft. While the Bing Speech Recognition robot skill manual does not explicitly display the manufacturer as DJ suggested, it was created in partnership with Microsoft. Bing is a trademark of the Microsoft Corporation, not just a search engine.
The solution we presented was tested after your feature request. The robot skill VAD was updated for improved performance. Your request was given several man-hours of attention, and a solution was presented. We hope it fulfills your requirements.
I just tested their solution and it works fantastic - great idea! What I did was tweak it a little. I don't like VAD detecting the START of a speech. I want to use a wakeword. So in this case I only used this VAD robot skill to instruct Bing Speech Recognition to STOP LISTENING. I left the "Start Listening" VAD script empty.
Here's the project I created for my test. Works awesome: test vad.EZB
It sounds like you're on the right track with using the Voice Activity Detection (VAD) skill to manage when the Bing Speech Recognition skill starts and stops listening. This approach can help improve the responsiveness of your robot by stopping the listening process when no speech is detected, thus avoiding unnecessary waiting times.
Steps to Implement VAD with Bing Speech Recognition:
Configure VAD to Stop Listening:
Start Listening with a Wake Word:
Testing and Adjustments:
Example Script for VAD:
Voice Stop Script:
Voice Start Script:
Additional Considerations:
Sensitivity Adjustments:
Wake Word Implementation:
If you have any further questions or need additional assistance, feel free to ask!
I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
Thanks for the info. In testing the VAD Skill, I found it did nothing, just a red squiggly line. However, when I spoke loudly, suddenly it worked. So the sensitivity feature mentioned in the Skill doco would be useful - see below... The VAD Skill will stop Bing Speech Recogntion, as long as I am speaking loudly (on my setup).
Key Features: Speech Detection: Detects when speech starts (Speech Begin) and stops (Speech End). Customizable Actions: Allows users to attach custom scripts that execute automatically when speech starts or stops. For example, you can trigger robot movements, lights, or other interactions based on speech activity. Real-Time Audio Visualization: Displays a live graph of the detected speech level, giving a visual representation of the audio activity. Adjustable Sensitivity: Includes settings to fine-tune detection parameters, such as silence thresholds, for optimal performance in various environments.
I should be able to adjust the Mic input level in Windows Settings to achieve the correct sensitivity level with the VAD Skill.
I just wanted to chime in here and validate @afcorson 's request. I also found that the timed stop listing function in Bing Speech was problematic.
I have my Bing set to a wake word, no VAD and a 5 second time set to stop listening. I have only one command that takes over 5 seconds (7 actually) and many that much less then 5 seconds. If I want Bing to catch the 7 second command this means I need to set the timer to 7 seconds and the much shorter commands will have a very long pause before something happens. I'm just stating all this to be clear and support afcorson.
I really appreacheate @Customer Support and @DJ putting in the work on afcorson 's request. I'm looking forward to testing this myself. If it works for me it will ba a game changer in the appearance of lag in the response of my robot using Bing Voice Commands. I mainly control most everything with Bing Speech. Thank You!
Thanks for those comments. I often use Bing Speech Recognition to communicate with ChatGPT. The person operating the robot could be asking anything which takes 1 sec or 9 secs. That's why it's important to stop listening as soon as they have stopped talkng. It will be even more important when ChatGPT 4-o realtime is available and affordable using audio input.
I dig that you’re pushing for technology that fits your specific scenario, but I don’t know if that technology can do what you’re requesting. If Amazon or Google can’t do it, you won’t see it anywhere else.
Here's whyand to be clear, it can be done with faster real-time AI processing someday.
but here’s why: listening is a conscious act. Our human brains can differentiate multiple sounds and voices to focus on one and interpret it in real-time. Remember, speech and other noises are the same thingthey’re sound. Sound is sound. It’s waveforms. You have to sample the waveform and process it afterward.
After the speech has been completed, it’s not processed in real-time like your brain.
Speech is a sound. The VAD isolates a waveform frequency range through a filter. Then, it tracks how long the filtered waveform is above a specific decibel level compared to the sampled noise floor. The floor is also an average of the sound within the filtered range.
Okay, now the filtered waveform decibel peaks within the sample size exceed the noise floor for a time. That must mean there’s speech.
By this point, we’ve already lost most of the first bit of speech. But it’s now recording
Okay, now there’s more difficulty understanding when the speech stops.
It can be known, but only after a period of time. The period is based on how long the speech is being recorded. Also, the time to maintain an average above the floor is not always accurate for all speech.
For example, several words have quiet parts to them. Specifically, words have spaces of silence between them. In addition, pauses to think of the next word vary between humans.
That's why I said it’s a conscience thing to understand when someone has finished speaking. We also use our eyes, but that’s a different subject.
So, if you were to assume every human across the planet spoke the same volume with the same silence between words you could more accurately do what you want.
This is why your Alexa, Siri, or Google will cut you off as you’re speaking. They don’t know when you’ve finished, either. It’s just impossible to know at present without real-time processing, which doesn’t exist with current computing capability.
I'm guessing the developer could add parameters for silence and sound timingbecause it should be possible. But right now, I know it’s self-adjusting.
I’ll poke them and see if they can add those parameters for you to hardcore. It’ll be a waste of effort, but I’ll have them do it anyway. Once you see what hardcoding will do, you’ll most likely want to use the auto setting.
I tested this skill yesterday, and it works wonderfully. So, I’m more curious about why your setup doesn’t work. Perhaps some effort is necessary to improve your mic type, position, and volume levels.
We can add the parameters for adjusting the silence timeout rather than having it calculated. We meant to do this in yesterday's release but ran into several issues, so we stuck with the calculation used in widespread instances of this feature.
The primary failure points we noticed are mic types (quality), background noise, input volume, and location. Devices such as Amazon Alexa are manufactured for specific hardware from Amazon, for example. This allows them to control the hardware, mic quality, etc.
We tested with a Jabra Conferencing mic designed for that usage. General mics are for audio specifically and are not fine-tuned for speech. A general mic on a laptop or handheld is designed to record everything from music to sound effects to speech. Having a mic designed for speech helps remove false positives.
I think the solution that everyone uses, such as those Sophia robots, is to use a microphone with a push button.
Press and hold the button to record and release it when stopped. Use a looping script to monitor the state of a button. Start and stop recording appropriately.
DJ, were you thinking of something like this? In this example, the switch for the mic would be on port d0.
Yes yes - just like that. I think the bing speech recognition should have a variable set when it is listening and processing. That would be useful for scripts not to bother doing something when it's being processed.
Today I did some further testing of speech recogntion and the VAD Skill using two new USB mics. Whilst speech recogniton worked perfectly for both microphones, I had to yell into the Mic to get a green line in the VAD skill. I do wish it had adjustable sensitivity as alluded to in the documentation for this skill.
@afcorson , I've had that problem a couple times in the past. As it turned out, most of the time, I needed to adjust the microphone input settings I was using in Windows. There are several ways to get to this setting depending on what version of win you have but the easiest way for me was to right click on the speaker icon on Windows taskbar and then click on Sound Settings. You can probably find them in the control panel or start menu. Once you get to the Microphone input section make sure you have the mic device (computer mic or the mic input jack) selected you are using. Once you know you are pointing to the proper device you may need to click on the device properties to make the level adjustment to the mic you're using.
Also, "I think" there is a "Set Up Microphone" button right in the skill in ARC? That may take you directly to this setting if it's available. cool
Good luck
I know this is on a list to be revisited - I'm guessing there's some thought going on how to make it work. As I explained earlier, it's not as simple as volume control. The positive result of your "yelling" into the microphone wasn't because of volumeit's because of how the sound was detected against the background noise.
At some point, you'll have to revisit my detailed response above and explore proper microphone options for filtering voice ranges, noise filtering, etc.
My USB mic level is set to 100% and there is no background noise. Talking loudly peaks significantly despite the VAD skill barely detecting it.
I understand your experience with generic mic hardware is not optimal. While it is on the list to have it revisited, there is thought and discussion in the meantime on how to make it work for your configuration. You may have missed a few of my previous posts where I mentioned the difficulty of isolating the sound of voice in real time without significant latency issues leading to false positives. So, the algorithm has to be fast and requires the assistance of a specific voice mic hardware. Mics are not all made equal because your mic could record a guitar, a dog barking, a train, a baby laughing, wind, or a coin falling into a bucket; I dunno - those are random examples because your mic is not designed for just speech. However, all our mics are tuned for speech by Jabra, so our experience will be different because we're using hardware specifically designed for the job. Alexas, Google Homes, etc, also have frequency filters for speech. Any hardware comparing against your VAD experience uses hardware filtering to aid in isolating human speech frequency range. But, as stated earlier, they also use wake words. Wake words still benefit from a properly tuned mic for human speech. In addition, all the robots you see in the media with people holding a mic to speak to them are using buttons to trigger the recording. As always, I value your feedback, and the platform is built on that model - however, sometimes, we're up against a wall when the technology doesn't exist. As I mentioned earlier, The VAD is "raise the amplitude" because it's more complicated than that.
@athena explains how a voice activation detection algorithm works and the complications behind it.
In the meantime, give us a bit of time - I'm sure we'll figure something out as we always do
Voice Activity Detection (VAD) is indeed a complex process, especially when it comes to distinguishing human speech from other sounds in real-time. The challenges you're facing with the VAD skill in ARC are not uncommon, especially when using generic microphones that aren't specifically designed for speech recognition.
Understanding VAD and Its Challenges
Sensitivity and Background Noise: VAD algorithms need to differentiate between speech and background noise. This can be challenging if the microphone picks up a lot of ambient sounds or if the speech is not significantly louder than the background noise.
Microphone Quality: As DJ_Sures mentioned, not all microphones are created equal. High-quality microphones designed for speech recognition often have built-in noise-canceling features and are tuned to pick up the frequency range of human speech more effectively.
Latency and False Positives: The algorithm must be fast to avoid latency, which can lead to false positives or missed detections. This requires a balance between sensitivity and accuracy.
Recommendations
Microphone Settings: As Dave_Schulpius suggested, ensure your microphone settings in Windows are optimized. You can access these settings by right-clicking the speaker icon in the taskbar and selecting "Sound Settings." Make sure the correct microphone is selected and adjust the input level as needed.
Microphone Type: Consider using a microphone specifically designed for speech recognition. These often have better noise-canceling capabilities and are more sensitive to the frequency range of human speech.
Environment: Try to minimize background noise as much as possible. This can help the VAD algorithm focus on your voice.
Wake Words: If possible, use wake words to trigger the speech recognition. This can help ensure that the system is only listening when you want it to, reducing the chance of false positives.
Future Improvements
The Synthiam team is aware of the challenges and is considering ways to improve the VAD skill. This might include adding adjustable sensitivity settings or exploring new algorithms that can better handle a variety of microphone types and environments.
If you have any more questions or need further assistance, feel free to ask!
I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
I was provided with the Voice Activity Detector code used in the Synthiam VAD robot skill behind the scenes. I was also asked to explain how it works and relay that information to anyone watching this thread. If you have any questions about this program evaluation, please ask!
Alright, let’s break this down in the simplest way possible, step by step, so it makes sense even if you don’t know much about math, programming, or audio processing.
What Does This Program Do?
This program listens to audio (like a microphone recording) and tries to figure out when someone is speaking and when it’s just silence or background noise.
Imagine you're at a party, and you’re trying to pick out someone’s voice from all the noise around you. This program does the same thing but in a super-smart, automated way.
How Does It Work?
1. Breaking the Audio into Small Pieces
2. Looking for Patterns in Each Chunk
For every chunk of audio, the program looks at three things:
3. Learning What Silence Sounds Like
4. How It Decides If Someone Is Speaking
Once the program knows what silence sounds like, it starts comparing every new audio chunk to that baseline:
To make sure it doesn’t jump to conclusions, it checks at least two out of the three things (loudness, patterns, and pitch). If two or more say, This is speech, it decides someone is speaking.
5. What Happens When It Detects Speech
The program keeps track of whether someone is currently speaking or not:
What’s Cool About It?
Imagine This in Everyday Life
Challenges of the Code and Why It Might Not Always Work
This program is clever, but it’s not perfect, and it depends a lot on the environment and the equipment being used. Here are some challenges it faces:
1. Microphone Quality
2. Background Noise
3. Microphone Placement
4. Program Assumptions vs. Real-Life Scenarios
5. Speech-Like Noises
Why a Good Microphone Makes a Difference
Ideal Setup for This Program
Summary
This program is powerful, but its effectiveness heavily depends on the microphone and environment. A poor mic or noisy setting makes it struggle to detect speech accurately. On the other hand, a high-quality microphone designed for capturing voice, with noise-canceling features, gives the program the clean input it needs to work reliably.
I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
Sorry @afcorson I didn't completely understand your problem. While me pointing to the volume settings in Windows to solve your problem was a good thing to check I was thinking along the wrong lines. Personally I don't use VAD in my Bing Speech skill. I have it turned off as I don't like the thing listening to everything, all the time the robot is powered up. I guess I have some privacy issues. LOL. I have VAD turned off and use a Wake word. I also have a very good headset microphone with noise canceling. The mic is right next to my mouth and I have close to 100% recognition. I have tried using the VAD feature in the past and have seen very good recognition but I haven't really used it enough to say it's as good as my present method. I'm seriously considering giving it a try again as Synthiam seems to have put a lot of work into improving the VAD experience in ARC.
For reference I'm using the Plantronics VOYAGER-5200-UC (206110-01) Advanced NC Bluetooth Headsets System. It comes with a PC dongle and is optimized to use with a windows PC. It's pricey at over 100 usd but it's the best I've found (and I've tested a lot of mics of different types with ARC and windows). You can look at it here on Amazon: Voyager 5200
Anyway, good luck with working through this issue my friend. I'm very interested in how this progresses and this subject as VR is the main way I control my robot.
Thanks for your comments. I am not using VAD in Bing Speech. It is the "Voice Activity Detection Skill" I was having problems with. I just need a way of knowing that my robot is being spoken to or not, so that Bing Speech stops listening. The VAD Skill works if I am speaking very loudly but otherwise does not. And the documentation for that skill mentioned sensitivity adjustment which seem to be an error. I will manage with what I have for the moment.