Cognitive Vision

How to add the Cognitive Vision robot skill

  1. Load the most recent release of ARC (Get ARC).
  2. Press the Project tab from the top menu bar in ARC.
  3. Press Add Robot Skill from the button ribbon bar in ARC.
  4. Choose the Camera category tab.
  5. Press the Cognitive Vision icon to add the robot skill to your project.

Don't have a robot yet?

Follow the Getting Started Guide to build a robot and use the Cognitive Vision robot skill.

How to use the Cognitive Vision robot skill

Use the Microsoft Cognitive Computer Vision cloud service to describe or read the text in images. The images come from the Camera Device added to the project. This plugin requires an internet connection. If you are using a WiFi-enabled robot controller (such as EZ-Robot EZ-B v4 or IoTiny), consult their manuals to configure WiFi client mode or add a second USB WiFi adapter.

The Synthiam Cognitive Vision Robot Skill utilizes machine learning algorithms to enable robots to recognize and understand objects, faces, and even emotions. This skill is part of the Synthiam ARC (Autonomous Robot Control) platform, which provides a suite of tools for building and controlling robots.

Here's a breakdown of how the Cognitive Vision Robot Skill works, which you can use to explain it in a manual:

  • Integration with ARC: The Cognitive Vision skill is integrated into the Synthiam ARC software, which means it can be easily added to any robot project within the ARC environment.
  • Camera Setup: To use the Cognitive Vision skill, the robot must be equipped with a compatible camera that is connected to the ARC platform. This camera serves as the robot's "eyes," capturing visual data for processing.
  • Skill Configuration: Once the camera is set up, the Cognitive Vision skill can be configured within the ARC software. Users can select different cognitive services (such as object recognition, face detection, or emotion recognition) that they want the robot to use.
  • Machine Learning Models: The skill relies on pre-trained machine learning models that have been developed to recognize various objects, faces, and emotions. These models can be updated and improved over time, enhancing the robot's recognition capabilities.
  • Processing Visual Data: When the robot's camera captures visual data, the Cognitive Vision skill processes this data using the selected machine learning models. The skill analyzes the visual input to identify and categorize objects, detect faces, or interpret emotions.
  • Outputs and Actions: The results of the cognitive analysis are then outputted within the ARC software. These results can trigger specific actions or behaviors in the robot. For example, if the robot recognizes a specific object, it can be programmed to approach it or manipulate it.
  • Customization: Users can customize the Cognitive Vision skill by selecting different recognition models, adjusting confidence thresholds for recognition, and programming unique responses to different visual stimuli.
  • Real-time Interaction: The Cognitive Vision skill operates in real-time, allowing the robot to interact dynamically with its environment. This real-time processing is essential for applications where immediate robot response is required.
  • User Interface: The ARC platform provides a user-friendly interface for the Cognitive Vision skill, making it accessible for users with varying levels of technical expertise. This interface allows for easy configuration and monitoring of the robot's vision capabilities.
  • Integration with Other Skills: The Cognitive Vision skill can be used in conjunction with other skills within the ARC platform, enabling complex and intelligent robot behaviors. For example, it can be combined with navigation skills to allow the robot to move towards recognized objects or individuals.


The behavior control will detect objects using cognitive machine learning. The robot skill will analyze the image, and each detected object will be stored in variable arrays—the width, height, location, and description of each object. The robot skill will also analyze the image for adult content. Use the Variable Watcher to view the detected details in real time.

Configuration Menu

The configuration menu for this robot skill allows setting values for the scripts and variables that hold the detected data.

  • Describe:
    This script will be executed after a Detection is completed. You can populate this script to speak the detected object or interact with the detected details.
  • Read Text:
    This script will be executed after a Read Text is completed. You can populate this script to speak the detected text in the image or perform some interaction with the detected details.
  • Detected Scene:
    This variable holds the description of the image after a detection. You can use this to get the description of the image. This is populated after a Detect instruction is sent.
  • Confidence:
    This variable holds the confidence value for the information detected. How confident is the system at the detection it had come up with?
  • Read Text:
    This variable holds the text that was detected in an image. This is populated after a Read Text instruction is sent.


The robot skill can execute scripts after the vision recognition is completed. This allows you to have the robot speak or perform an action based on the detected objects in that script. There is a script for the "Describe", which is executed after the Detect is completed. There is also a script for Read Text, which is executed after a ReadText is executed.

Control Commands

Several control commands for interacting with this robot skill from other skills exist. Specifically, when you want to scan the image for the latest detection information, you must send the Detect controlcommand.

ControlCommand("Cognitive Vision", "Detach");
Detaches the robot skill from the current camera.

ControlCommand("Cognitive Vision", "Detect");
Simulates pressing the "detect" button. This will detect the objects in the current camera frame and populate the image description variable. The variable can be configured in the configuration menu of this robot skill.

ControlCommand("Cognitive Vision", "ReadText");
Translates any text in the image into the appropriate variable configured in this robot skill's configuration menu.

Educational Tutorial

The Robot Program created this educational tutorial for using the Cognitive Vision behavior control by Synthiam. This same procedure can be executed on any robot with a camera or PC with a USB Camera.

What Can You Do?

An easy example of using this control is adding this simple line of code to the control config. The code will speak out of the PC speaker what the camera sees. Here's a sample project: testvision.EZB


DJ Sures from Synthiam created this demo using a Synthiam JD by combining this Cognitive Vision behavior control, Pandora Bot, and speech recognition. He could have conversations with the robot, which is quite entertaining!

You will need a Camera Device and this plugin added to the project. It would look like this...

And add this simple line of code to the plugin configuration...

say("I am " + $VisionConfidence + " percent certain that i see " + $VisionDescription)

Limited Daily Quota

This robot skill uses a shared license key with Microsoft, enabling ARC users to experiment and demo this robot skill. The shared license key provides a daily quota of 500 requests per day for ARC Pro users. Because this robot skill uses a 3rd party service that costs Synthiam per transaction, the daily quota is designed not to exceed our spending limit. If your application requires a higher daily quota, we will provide a robot skill that allows you to specify your license key to pay Microsoft service directly. Contact Us for more information.


Upgrade to ARC Pro

Elevate your robot's capabilities to the next level with Synthiam ARC Pro, unlocking a world of possibilities in robot programming.

#1   — Edited

I just watched the videos on these services and tried to set up a text read and he just says he is 87 percent he sees the words..but not the actual words. Any one with pointers to get this to work? I tried both hand written and typed words. Do they need to be a certain size? Or this service from microsoft needs work?

say("I am " + $VisionConfidence + " percent certain that i see the words" + $VisionReadText)

#2   — Edited
  1. The code doesn't have space when appending $visionReadText to the string, which means it will would be a real weird word (i.e. wordsometextthatwasdetected). Your code should be...

say("I am " + $VisionConfidence + " percent certain that i see the words " + $VisionReadText)

*Notice the space after "words"

  1. i tested and it works fine reading text. Re-check your code and see if it works without spaces

User-inserted image


How did I miss that..cross eyed! Thank you ..once again! Must sleep..

#4   — Edited

...odd the new version filters adult content?! How is that a feature? What was Microsoft trying to prevent/use cases? Especially if you think of all the other filters they could have added for whats found in an image.

#5   — Edited

I don’t believe there’s any filtering being done. There is a value of how much adult content there is, but I don’t believe anything is filtered

you can always stand nude in front of your robot to test it out hahaha

#6   — Edited

Yeah your right not a filter more of a tag. Still wondering why that’s a feature . Why not something useful like cats chasing dogs?

I did a little research earlier today and it’s possible to create custom object detection projects. Train and prediction. Makes it more useful for a case by case robot.

..of course I got naked in front of the vision cognition said '100% sure you should put your clothes back on!' Lol.


Ya it’s a rating - it’ll help some applications to prevent abuse. I know we’re using it for a new service we’re releasing in beta next week.

ill take a look at the custom detection part. Although it is quite easy to do local with the object tracking built in the camera control


Looking forward to that beta!


Is cognitive emotion redundant, because Cognitive Face reports back the same emotions as well as the other options: age, name, etc? Not sure if I am missing something.

#10   — Edited

This skill, Cognitive vision does not return any face or emotional information.

I believe you are asking about Cognitive Face and Cognitive Emotion? Those two report similar stuff, except Emotion doesn't report face. There's slight differences in the returned data of those two. This skill that you replied to is Cognitive Vision and not related to either of those:)


I see the parameter you need to pass in the ControlCommand to read text is "ReadText". But what parameter do you send to describe an image? "DescribeImage"?

Thomas Messerschmidt

#12   — Edited

(The robot skill will analyze the image, and each detected object will be stored in variable arraysthe width, height, location, and description of each object.)  I do not see any arrays as your above photo shows, do we need to enable this somewhere? For some reason my results have been mixed. The computer camera is taking a very clear picture  106k Bytes and returning after about 3 seconds with 81% confidence Say PC $VisionDescription in Blockly but it does not want to say it any more. Is there a preliminary block that I need to put above it to verify it has loaded the new variable value? The read text from image doesn't work at all as fxrtst was mentioning which I was really hoping it would. Tried multiple different word scenarios. Used Say PC $ReadTextFromImage in Blockly. It did say something the very first time but nothing since.  I have a lot of plans for this skill, just need to get it work like your video.


Thanks for pointing me in the right direction, Got it working reading words but couldn't get it to read numbers at all. Is there any way that I can help train it? Taking this to the next step is there a way to read only a certain area of the screen grid rather than read everything on the screen-I have my reasons. What would the script look like? I realize that we can do that on our end with x,y variables but microsoft will probably will want to speak the whole screen. I will try to find the manual from Microsoft on this but if you have the more in depth manual on it's capabilities that would help. They probably have lots of other things they have encountered and overcome as well. Thanks