Thumbnail

Azure Text To Speech

by Microsoft

Enable fluid, natural-sounding text to speech that matches the intonation and emotion of human voices.

Requires ARC v7 (Updated 11/23/2023)

How to add the Azure Text To Speech robot skill

  1. Load the most recent release of ARC (Get ARC).
  2. Press the Project tab from the top menu bar in ARC.
  3. Press Add Robot Skill from the button ribbon bar in ARC.
  4. Choose the Audio category tab.
  5. Press the Azure Text To Speech icon to add the robot skill to your project.

Don't have a robot yet?

Follow the Getting Started Guide to build a robot and use the Azure Text To Speech robot skill.

How to use the Azure Text To Speech robot skill

The Synthiam ARC robot skill for Azure Text to Speech is a powerful integration that enables your robot to generate human-like speech using Microsoft's Azure Text to Speech service. This skill allows you to take your robotics project to the next level by providing your robot with a natural and dynamic voice. Whether you are building a companion robot, educational tool, or any other robotic application, this skill enhances user interaction and engagement through spoken language.

Applications

Human-Robot Interaction: Enable your robot to engage in natural conversations with users, making it a more relatable and interactive companion. Educational Tools: Enhance the educational value of your robot by enabling it to provide spoken explanations and instructions to learners. Assistive Technology: Create robots to assist individuals with disabilities by providing spoken assistance and information. Entertainment and Storytelling: Develop storytelling robots to bring characters and narratives to life through speech synthesis.

Get started with the Synthiam ARC robot skill for Azure Text to Speech and bring your robotic project to life with expressive, human-like speech capabilities. Elevate the user experience, foster engagement, and unlock a world of possibilities with this innovative integration.

Main Window

User-inserted image

The main window in the ARC project workspace displays debug and activity information.

Configuration Window

User-inserted image

Neural Voice Enter the neural voice that you wish to use. This value can be dynamically changed using the ControlCommand syntax as well.

Sample Press the SAMPLE button to hear the sample of the selected voice.

View List View a list of the available voices.

Speak out of EZB If checked, the spoken audio is sent out to the EZB speaker (if supported). Otherwise, the audio is spoken from the PC's default output device.

Start Speaking Script The script will execute when the text begins to speak.

Speak Text Variable The variable that will hold the text that is being spoken.

Text Variable The variable that stores the current text that is being spoken.

Available Voices

Microsoft provides a list of available voices for the Azure Text-to-Speech system here: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts

When adding a voice to the configuration window from the above link, copy the greyed text on the right of the chart in the "Text-to-speech voices" column and paste it into the neural voice field. See this image below for the circled text as a demonstration. This text will be pasted to the neural voice field in the robot skill configuration screen to change the voice. There is also a ControlCommand() to change the voice programmatically. 

User-inserted image

Control Commands

You can view the available control commands by viewing the "Cheat Sheet" when editing a script in ARC.

User-inserted image

*Note: Using the "Speak" ControlCommand is recommended, as the SpeakSsml requires advanced understanding. The Ssml format can be researched on these links:

  1. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup
  2. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure
  3. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice
  4. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-pronunciation
  5. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-ssml-phonetic-sets

Example

This example will walk you through creating a simple project that speaks the entered text out of the PC speaker. Follow the instructions, and your computer will speak in any voice you configure!

User-inserted image

  1. Add the Azure Text To Speech robot skill to your project (Add robot skill -> Audio -> Azure Text To Speech).

  2. Add a SCRIPT robot skill to your project (Add robot skill -> Scripting -> Script).

  3. Edit the script robot skill and insert the following javascript code.


ControlCommand("Azure Text To Speech", "speak", "Hello i am speaking to you");

  1. Save and close the script editor.

  2. Press the START button on the script, and the robot will speak out of the PC speaker.

  3. Press the CONFIG button on the Azure Text To Speech robot skill to view the configuration. Press the VIEW LIST link to view the available voices in this configuration window. Paste the voice that you wish to use into the neural voice textbox. Press SAMPLE if you want to hear a sample of the selected voice.

    User-inserted image

  4. To change the voice programmatically (in code), you can send the control command. This example will change the default voice to US Jenny Neural.


ControlCommand("Azure Text To Speech", "setVoice", "en-US-JennyNeural");

How Does Speech Synthesis Work?

Text-to-speech (TTS) technology converts text into speech sounds through a complex process that involves several key components and techniques. Here's an overview of how TTS works:

1. Text Analysis:

  • The process begins with the analysis of the input text. This involves breaking down the text into smaller units, such as words, sentences, and paragraphs.
  • The TTS system may also analyze punctuation, capitalization, and other text features to add appropriate prosody and intonation to the synthesized speech.

2. Linguistic Processing:

  • Once the text is segmented, linguistic processing takes place to identify the text's phonetic, prosodic, and grammatical elements.
  • The system determines the language, dialect, and pronunciation rules to be applied. It also identifies the stress and intonation patterns for each word and sentence.

3. Phoneme Conversion:

  • Phonemes are the most minor units of sound in a language. The TTS system converts the linguistic information into a sequence of phonemes that represent the spoken sounds for the words in the text.
  • Different languages have different phonemes, so the TTS system needs to know the specific language used.

4. Prosody and Intonation:

  • Prosody refers to speech's rhythm, pitch, and stress patterns. Intonation includes the rise and fall of pitch in sentences.
  • TTS systems use linguistic and contextual information to determine the appropriate prosody and intonation for the synthesized speech, making it sound more natural.

5. Acoustic Modeling:

  • Acoustic modeling involves mapping the phonemes to their corresponding audio representations. This includes the selection of waveforms or audio samples for each phoneme.
  • TTS systems use databases of pre-recorded phonemes or generate speech sounds synthetically using algorithms like concatenative or parametric synthesis.

6. Synthesis:

  • The synthesized speech is generated by combining the acoustic representations of phonemes to create a continuous audio stream.
  • TTS systems may apply techniques like concatenative synthesis (using pre-recorded phonemes), formant synthesis (generating speech based on the vocal tract's formants), or other methods to create the final speech output.

7. Articulation:

  • The TTS system simulates the articulation of speech sounds, including the movement of the vocal tract and other speech-related organs, to create natural-sounding speech.

8. Output:

  • The final audio waveform is generated and played through speakers or other audio output devices, making the synthesized speech audible.

It's important to note that the quality and naturalness of TTS output can vary based on the complexity of the TTS engine, the available linguistic knowledge and phoneme databases, and the quality of the acoustic modeling. Modern TTS systems, especially those based on deep learning techniques, have significantly improved in producing highly natural and expressive speech.

How Azure Text to Speech Works

Azure Text to Speech is a cutting-edge cloud-based service offered by Microsoft, designed to convert text into lifelike, natural-sounding speech. This powerful technology harnesses the capabilities of deep learning and neural networks to generate high-quality audio output from input text. With its extensive language and voice support, Azure Text to Speech provides a versatile solution for various applications, including human-robot interactions, accessibility, education, customer service, and more.

The technology behind Azure Text to Speech is rooted in sophisticated machine learning models and neural networks. These models have been trained on vast amounts of multilingual and multitask supervised data, resulting in the ability to generate indistinguishable speech from human speech. Azure Text to Speech employs advanced natural language processing techniques to ensure accurate pronunciation and intonation, making the synthesized speech sound incredibly realistic.

Language and Voice Support

Azure Text to Speech offers one of the most comprehensive language and voice support libraries. Users can choose from many languages and dialects, allowing seamless communication with diverse audiences. Furthermore, voice customization options enable users to fine-tune characteristics such as pitch, speed, and even the emotional expressiveness of the generated speech.

Text Formatting and SSML

To control the pronunciation, emphasis, and intonation of the generated speech, users can employ text formatting and Speech Synthesis Markup Language (SSML). This enables high customization, ensuring the speech output aligns perfectly with the intended message and context. Find out more information about the SSML format here...

  1. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup
  2. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-structure
  3. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-voice
  4. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-synthesis-markup-pronunciation
  5. learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-ssml-phonetic-sets

Security and Compliance

Synthiam and Microsoft are committed to providing a secure and compliant environment for your data. Azure Text to Speech adheres to rigorous security and compliance standards to ensure your information is handled with the utmost care and responsibility.

Use Cases and Industries

Azure Text to Speech has found relevance in various industries and use cases, including accessibility, customer service automation, education, entertainment, and more. Real-world examples and success stories showcase its versatility and impact. Enhance your experience with Azure Text to Speech by following best practices. Optimize text input, select the most suitable voice for your application, and maximize SSML for effective customization.

Real-world case studies and testimonials from organizations and individuals who have experienced success with Azure Text to Speech can inspire and guide users in their ventures.

Limitations

The generated speech is limited to 500 characters per call and 1000 daily calls with an ARC pro subscription.


ARC Pro

Upgrade to ARC Pro

Get access to the latest features and updates before they're released. You'll have everything that's needed to unleash your robot's potential!

PRO
USA
#1   — Edited

Works great thanks

tried a few voices

PRO
USA
#2  

Quote:

Press the Azure Text To Speech icon to add the robot skill to your project
This plug in is not showing up for me...running 06.09...am I missing something?

User-inserted image

#3  

Ensure you have an internet connection when ARC is loaded. The list of available plugins requires internet connectivity to reach our servers. Additionally, if two copies of ARC are running at the same time, the plug-ins will not update because the files will be in use by the other copy of arc.

Additionally, you can scroll to the top of any robot skill page and press the large Get Version X button.

PRO
USA
#4  

For some reason windows no longer had an association of .ezplugin files with the plug in manager. Re-associated the files and now installed correctly.

PRO
USA
#5  

All works here

#6   — Edited

Updated v3 to capture the error for when testing a voice by name that does not exist in the config menu.

#7  

V7 updated...

  • Added a script that will execute for each time text is spoken.
  • Added a variable that stores the current text that is being spoken

Read the manual above for more information.

#8  

We have updated the manual above to explain what part of the voice list to copy and paste into the configuration of this robot skill.

Unknown Country
#9  

What if I have my own Azure Speech Service, can I use this to connect to that rather than be limited to ARC Pro?

#11  

@Athena How can I make Azure speech be the default speech whenever I use Javascript Audio.say?

PRO
Canada
#13  

To make Azure Text to Speech the default speech synthesis method when using the Audio.say() command in JavaScript within Synthiam ARC, you would need to utilize the ControlCommand() function to interact with the Azure Text to Speech robot skill. However, the Audio.say() command in JavaScript directly utilizes the computer's default speech synthesis engine and does not natively support redirecting through Azure Text to Speech without explicitly invoking the Azure Text to Speech skill.

Here's a workaround approach:

  1. Ensure Azure Text to Speech Skill is Added and Configured: First, make sure you have added the Azure Text to Speech robot skill to your ARC project and configured it with your Azure credentials and preferred settings.

  2. Use ControlCommand with Azure Text to Speech: Instead of using Audio.say(), use the ControlCommand() function to send the text you want to be spoken to the Azure Text to Speech robot skill. You'll need to use the specific ControlCommand for speaking text, which is typically something like ControlCommand("Azure Text to Speech", "Speak", "Your text here").

Here's an example of how you might do this in JavaScript within ARC:

// Replace "Azure Text to Speech" with the exact name of your Azure TTS skill in the project
// Replace "Your text here" with the text you want to be spoken
ControlCommand("Azure Text to Speech", "Speak", "Hello, how are you today?");

Note: The exact command to use ("Speak") might vary based on the implementation of the Azure Text to Speech robot skill. You should refer to the documentation or the "Cheat Sheet" in ARC for the exact commands available for the Azure Text to Speech skill.

  1. Custom JavaScript Function Wrapper (Optional): For convenience, you can create a custom JavaScript function that wraps the ControlCommand() invocation. This way, you can easily reuse it throughout your script without having to type the full ControlCommand() each time.
function sayWithAzureTTS(text) {
    ControlCommand("Azure Text to Speech", "Speak", text);
}

// Usage
sayWithAzureTTS("This is a test message using Azure Text to Speech.");

By following this approach, you can effectively make Azure Text to Speech the default method for speech synthesis in your ARC project when using JavaScript. Remember, this method requires explicitly calling your custom function or the ControlCommand() for each instance where you want text to be spoken using Azure Text to Speech.


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
#14  

I was just checking as I have already made a work around (and it works very nicely) because I did not see a way to do it in the manual.

PRO
Synthiam
#15   — Edited

That's what functions (or words) are for. You see, a tree is a tree, a lake is a lake, and a banana is a banana. You can't make a banana into a tree. Or you can't make a lake into a banana.:) hehe

So you can't make a function into another function. Audio.Say is a function. You can't make it into a different function. That's why we have different functions that use different words

So you have the word "banana" to help differentiate between a banana and a tree and a lake. hehe

#16  

Ok maybe I asked the wrong question. Is there a way to have all spoken words in ARC be spoken by Azure. I can do it now but is a 2 step process, put it in variable then speak it. It would be nice to just have any spoken words be Azure just as you would do a PC say _______ and then it takes over.