Asked
Resolved Resolved by Athena AI!

Windows 11 TTS With ARC

Can ARC utilize the Windows 11 Text-To-Speech (TTS) engine in conjunction with ChatGPT, instead of the other TTS skills provided within ARC? This would allow users who are implementing projects on Windows 11 Single Board Computers (SBCs) to take advantage of the onboard TTS capabilities built into the Windows operating system.


Related Hardware LattePanda

ARC Pro

Upgrade to ARC Pro

Subscribe to ARC Pro, and your robot will become a canvas for your imagination, limited only by your creativity.

#9  

The builtin speech recognition is not very reliable for free speech. There are several other robot skills for using free speech. It would be recommended to use Bing speech recognition as used in several tutorial videos. This is also used in the chat gpt robot skill.

PRO
Canada
#10  

That's correct. The builtin Speech Recognition skill can be less reliable when it comes to capturing freeform speech. Synthiam Support recommends using other speech recognition skills-such as the Bing Speech Recognition skill-which is already featured in several tutorial videos and is used in conjunction with the ChatGPT skill.

The Bing Speech Recognition skill is designed to handle more natural, freeform speech with higher accuracy. When you use it, you'll benefit from improved voice-to-text results, making your ChatGPT interactions smoother and more responsive. This is why the ChatGPT skill sample projects often use Bing Speech Recognition instead of the builtin STT.

Would you like more guidance on setting up Bing Speech Recognition in your project or integrating it with ChatGPT?


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
PRO
Canada
#12  

That's right! The tutorial video demonstrates two key points:

  1. Capturing free speech as text (using a more reliable recognition engine like Bing Speech Recognition)
  2. Overriding the default Audio.Say command by adding a dedicated text-to-speech robot skill.

By doing so, you can achieve more accurate speech-to-text conversion and enhanced TTS performance in your ARC projects-particularly beneficial on Windows 11 SBCs like the LattePanda.

If you'd like, I can help you set up and integrate the Bing Speech Recognition skill with ChatGPT or guide you in configuring a custom TTS robot skill to override Audio.Say. Just let me know what additional details you need!

For more information and to see these features in action, check out the tutorial video here:
Free Speech to Text & Text-to-Speech Tutorial

Let me know how I can help you further!


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
#13  

I know about the other skills and I have tried them.   The issue is they are slow as they are cloud based.   The Windows 11 built in STT works very well and is local and part of the OS so it is much faster.  This is why I asked the question.

My goal was to leverage ChatGPT skill with a local LLM using the windows 11 STT and TTS with it.   Currently I am using Piper and Whisper along with Ollama on my other SBC (Jetson).  The issue is ARC is the main system in the robot and you have started building more and more features into the ChatGPT skill and I don't really want to move or duplicate this on the Jetson if I can.   I have thought about adding the Piper and Whisper as Python calls in ARC.

I would love Supports thoughts on this.   The goal is a fluid conversation and interaction with the robot.

#14  

Windows 11 is an operating system with its own set of features. Sometimes, those features can be used by other applications. In this case, the question you're asking is not very possible or clear. This is because the question isn't specific to Windows 11, but instead to Windows 10 and above, and Cortana, which cannot be accessed on an application-by-application basis. This is similar to how Siri or other proprietary speech systems work.

However, the closest match that you'd want for localized translation would be: https://synthiam.com/Support/Skills/Audio/Total-Speech-Recognition?id=17688

But, the most reliable is Bing, which is cloud-based.

PRO
Canada
#15  

It sounds like you’re looking for a truly local, low-latency solution by leveraging Windows 11’s builtin speech features for both STT and TTS in ARC, especially to work hand-in-hand with the ChatGPT skill and your local LLM without relying on cloud services.

Unfortunately, Windows 11’s native STT/TTS (which are closely tied to Cortana and system APIs) aren’t directly exposed for application-byapplication use like some of the other TTS skills in ARC. This limitation is similar to how other proprietary systems (Siri, for example) aren’t directly hookable by third-party applications.

However, you do have a couple of pathways you can consider:

  1. Custom Python Integration:
    Since ARC supports Python scripting, you could write Python code that uses Windows’ native APIs for TTS/STT. For example, using libraries such as pyttsx3 for TTS (which works offline and leverages SAPI on Windows) or other Windows-specific speech libraries for STT. You can then call these Python functions from within ARC. This would let ARC remain your central system while you integrate your lowlatency local speech processing.

  2. Incorporating Existing Tools into ARC:
    You mentioned you’re already using Piper and Whisper on your Jetson. If you’d prefer to keep everything in ARC (especially if ARC is your main robot controller), you could create Python calls within ARC that trigger your Piper/Whisper pipelines. This might require you to adapt your existing code to run as subprocesses or modules inside ARC’s environment. That way, you can maintain the fluid interaction you’re aiming for without duplicating the ChatGPT skill’s functionality on a second SBC.

  3. Custom robot skill for Windows Speech:
    Another option is to develop a custom robot skill specifically tailored to Windows 11 speech capabilities. This would involve using the Windows Speech API (SAPI) through a language like C# or Python and then integrating that module within ARC. Although it’s more work upfront, it would let you fully leverage the native, lowlatency speech functionality of Windows 11 while still communicating with your ChatGPT skill.

In summary, while ARC doesn’t directly support using Windows 11’s builtin STT/TTS (due to the way Windows restricts access to Cortana-related features), you do have options to create a custom integration. If your goal is a fluid conversation and interaction, developing a custom Python solution (or even a custom robot skill) that calls Windows’ native APIs - or wraps your existing local solutions like Piper and Whisper - might be the best path forward.

Would you like more detailed guidance on setting up a Python-based solution within ARC or any help getting started with building a custom robot skill for Windows TTS/STT? Let me know how I can best support your project!


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
PRO
Synthiam
#16   — Edited

There is a library called vosk (I think) that is popular for localized speech to text. Running local is really expensive on hardware though. I think it requires dedicated 16gb ram last I checked. The model is a few gb file size as well.

Localized recognition isn’t popular - and I do not believe windows Cortana is localized either for dictation. The windows speech recognition is a hybrid of localized (for known commands) and cloud (for dictation)

the reason you don’t see localized dictation is due to how cpu gpu and memory intensive. The training dictionaries are massive.

There is a mode for cloud dictation that streams results as they’re received. We’ve always had difficulty figuring out how to implement that. This is because there’s no start or end to a dictation session. It’s designed for dictation to a document.

If there were ways to determine pauses and silence, it would be more useful for open conversational dialog.

Not sure what Athena’s thoughts are on that.