Asked
— Edited
The IBM Watson Services plugin created by @PTP currently allows you to perform Speech to Text and Text to Speech as well as visual recognition using Watson Services.
You can download and install the plugin here https://www.ez-robot.com/EZ-Builder/Plugins/view/251
Before you can use the plugin, you will need to apply for a free 30 day trial of IBM Speech to Text Account here https://www.ibm.com/watson/services/speech-to-text and a free 30 day trial of IBM Text to Speech https://www.ibm.com/watson/services/text-to-speech/
I will create some examples as it moves forward and try to answer any how to questions.
Thanks for creating this PTP, as with all your plugins this is an excellent piece of work and showcase of your talents.
I'm almost done with the Vision Recognition and Conversation Services.
They are simple to use, watson handles all the setup and plumbing for you.
I'm working on a demo but requires additional hardware and low level code so i'm busy ..
If anyone have ideas for a demo using a Six or JD please contribute.
Great work @ptp as always. I look forward to try it out.
I'm running a series of human-robot interaction experiments with a couple of students at my lab in about a month from now. We could see if we could integrate this module in one of the scenarios with the JD. If it works out we can add some performance metrics, which will give you an idea how the system performs in an interactional setting. If nothing else, I can give you a few videos of people interacting with the system. That might be a bit more interesting than a video. Let me know what you think
@larschrjensen,
Thanks for the feedback, and yes more feedback is welcome and it helps to generate/fuel more ideas.
@ptp I've been playing around with the plugin and it works quite well. I've noticed that you can't set the silence threshold below 2000 ms. This effectively means that there is at least a 2+ second pause between an utterance and a response. This is quite long considering that people normally expect a response within 300 msec. Is it possible to change this? I realize this can possibly affect the perfomance and send off utterances prematurely, but it would make the plugin a bit more flexible.
@larschrjensen,
Done, the minimum value is set to 500 ms.
The audio's capture buffer is 100 ms, the handler gets notifications every 100 ms.
Let me know if is ok, i can change to 300 ms.
I was in middle of something, i hope no harm done.
Fantastic, thanks.
Just noticed visual recognition update. Nice. I had a quick play, looks great, I will need to learn how to train a model in Watson Studio.
Curious is there a way to get ControlCommand to wait until visual recognition is complete as I found I was reading the previous variables and not the new ones. I tried a sleep and that seemed to work but I was setting for 1000 and sometimes was not long enough. I played with setting $WatsonClassifyTypes[0] ="wait" and then waiting until != "wait" but kept getting stuck in a loop. I am not sure if there is a better way. I was triggering visual off speech to text "What do you see"
@Nink,
1) You can add your script to the VR script, it will be executed when the visual recognition is complete.
add a small VR script:
then monitor the variable in another script. The variable must be initialized in an init script.
3) There is an internal CaptureId variable. I'll expose the variable in the next update. And you will be able to monitor the $WatsonPictureCaptureId variable.
To avoid begin caught between two different calls, the VR script (1) is the best option. After the VR is done, the variables are assigned with the results and the VR script is executed. If there is another VR call the results are queued waiting for the VR script to finish. While your VR script is being executed the results (e.g. variables) are constant and related to the current call.
I'm experiencing an issue with the plugin. After running for about 30 minutes ARC will freeze up. According to windows the program is still responding, but effectively it is not. Ram usage also increases from around 300 mb to 600 mb. It only happens when listening is active . STT status doesn't make a difference. No other project controls or plugins are running. I can reproduce this error on a second pc in a different project.
Any thoughts?
@larschrjensen
I fixed a memory leak. Let me know if the new version fixes the problem.
@ptp Indeed it does. Thanks
Thanks @ptp. I feel a little silly now. Worked like a charm. I was calling the VR from speech to text and then processing in the text to speech script and not creating a script after VR processed. User Problem :-)
@nink
Please send me a email, I wish to discuss something with you. My addy is in my profile.
Thanks
The following posts are related to the Visual Recognition.
I took a picture of playing card (Seven of Clubs)
Then i performed a few quick tests:
IBM Watson Visual Recognition Services:
Microsoft Computer Vision API :
Google:
It's obvious these results don't help.
Step 1: Create and train an image custom classifier:
I've used the EZ-Robot camera V2, it's important to use the same camera for training and recognition.
Using the ARC camera snapshot control, i took 10-15 pictures of each card (7,8,9,Jack,King) of Clubs.
It's important to test different light conditions, angles, orientations etc.
I created a zip file one for each card image folder.
then i run the following script:
This will upload the images files and will create a Custom classifier name: PlayingCards with 5 classes (SevenOfClubs ... KingOfClubs)
The training will start after the upload, can take a few minutes, depends on the number of classes, pictures, picture size etc.
Step 2: List the existent custom classifiers and their status e.g. Ready, Training.
is training:
running again the same code:
it's ready to be used.
Step 3: Use the custom classifier
The following code uses the default (watson classifiers)
The following code uses the above custom classifier.
Note: PlayingCards_978301467 is the classifierId, PlayingCards is the classifier name.
The following code uses the default and the above custom classifier.
Note: you can pass multiple classifier ids
Testing the initial picture (Seven of Clubs) with the custom classifier:
SevenOfClubs class has the highest score.
You are amaizing.
Once you have a classifierId (after you create one) you can update a classifier adding new classes e.g. cards, or delete the existent classifier, please check the ControlCommand() cheat.
The free service i.e. Lite, only allows one custom classifier e.g. "PlayingCards", and does not allow updates to the existing classifier.
So the only solution, is to delete the existent classifier, and create again with more classes or create another one for another purpose, or if you feel confident move to the standard service.
bear in mind, a good training requires pictures, more than merrie.
I did a few tests (with 10 pictures per cards) an eight of clubs can get higher score in a specific angle/light.... the solution is to train more pictures.
Thanks Nink.
Taking the pictures, takes time, it's very important the small details: playing with the card, putting something below to create a distorted image, change the distance/orientation from the camera, play with the light etc, maybe introduce some minors errors (fingers while holding ?).
Although the Lite model does not allow incremental updates, i plan to use my assistants e.g. Kids, to take more pictures and then I'll try to create the classifier with all the zip files (big upload).
Thanks for this ptp. My daughter is visiting so I am entertaining this weekend but will get back to this during the week and start taking photos of cards with an ez-robot camera.
I uploaded about 15 cards 15 photos each in various lighting. Various levels of success. I am using a cheap deck I picked up at local dollar store (My logic was I can buy a lot of the same cards cheap). They don't look quite like a normal deck so I may try a better deck of cards but I am confident I can get it to the point where it will recognize all 52 cards with lots of training.
Now the next question is, How do I get it to recognize multiple cards at once? Bounding boxes?
@Nink,
we need an offline solution to detect objects, and then apply VR for each object.There are a few ideas using opencv and tensorflow.
tensorflow with Watson: https://medium.com/ibm-watson/dont-miss-your-target-object-detection-with-tensorflow-and-watson-488e24226ef3
Tensorflow on linux is well supported you will find a lot of examples with linux and python, some of them on a Raspberry PI. Windows is a different story.
I'm exploring the windows implementation, but, so far it's not possible to built a ARC skill plugin, most libraries are x64 and ARC is a x86 (32) bits application.
The solution is to built a 64 bits application to handle the tensorflow logic and create a communication layer between the application and the ARC skill plugin.
OK thanks ptp. It sounds like it is in the to hard basket for now :-) I will just stick with the one card at a time :-)
I photographed all the cards if you want them but some need a lot more photos (or I have too many bad quality photos) as I am getting a lot of false positives. I wish the free version of Watson VR would allow us to retrain when we have an issue (No it is not an ace of clubs it is an ace of spades).
I can send anyone who wants a physical deck that I used, I Purchased them DollarRama (Canadian Dollar Store) They are Victoria brand.
Here is my edit of ptp script so others don't need to re type. Note my pictures location is different so you will have to change path but a search and replace in your fav editor is easier than retyping and debugging (Get a zip file name wrong and it will not work).
I spent about 10 hours of photographing, uploading, training (Takes about an hour to train now with over 2000 photo's) using an EZ-B and Camera placing cards on table and changing position / lighting. I had about 80% of the deck with very high accuracy. If it didn't read a card correctly I would take more photos, upload train and repeat.
It can take anywhere from 2 to 20 seconds to come back with an answer now when you show a card.
Now confident I was ready, I pulled out JD and held up a card in front of him. Not a clue (about 5% accuracy).
This tells me my training method is bad or we need a very controlled environment.
I think we could probably do a controlled read if we used a static base with controlled lighting, Example a Game of Blackjack (21) between JD and Six and me as the dealer, with JD and Six sitting in specific positions and cards placed on playing board in predetermined locations.
The other alternative is I create a method that is automated for reading cards and taking photo's. Perhaps it is something like hand JD a Card and then have a script run so he can then turn around in a circle and change the angle of the card taking pictures so the lighting, angle and background keep changing. Then when he is finished (Say 100 photos) upload the photos he took and ask for another card.
Thinking ....
Edit - Turns Out JD dexterity is not good enough to read a card he is holding (Can't get good enough angle / distance) so it will require a custom bot to read cards (More Servo's and perhaps a 360 Servo). Also it appear not just the same model camera is good enough, it has to be the exact same camera. If I take photo's with JD and try to read them on EZ-B + Camera doesn't work and vise versa.
@Nick,
I'm still working on it and researching other options. I've compiled tensorflow from source code, took me 3 failed attempts, successful compilation took 4 hours (core I7) and almost 6 GB disk space, to obtain a 64 bits c library (dll and lib).I can't find that brand, maybe you can suggest one available both in amazon.ca and amazon.com. That way we could share pictures.
I took pictures too... i wanted to test two different resolutions: 320x240 and 640x480, unfortunately the high resolution pictures are low resolution, the snapshot control ignored the camera control setting.
I'll add a picture/snapshot functionality to the watson plugin.
Can you share a zip of one your cards ?
Hi ptp. Here is one card. I am apprehensive sending photos of all the cards due to copyright laws. Maybe we should Print off some creative commons cards.
NineofHearts.zip
Plan B
JD a record player, an iPad playing a Netflix movie.
Have you seen the .net bindings for tensor flow for a plugin? https://www.nuget.org/packages/TensorFlowSharp/1.6.0-pre1
@DJ,
Yes, the wrapper works on top tensorflow.dll (libtensorflow.dll) available from the tensorflow project.
https://github.com/tensorflow/tensorflow/issues/10817
sources: https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-windows-x86_64-1.2.0.zip
http://ci.tensorflow.org/view/Nightly/job/nightly-libtensorflow-windows/lastSuccessfulBuild/artifact/lib_package/libtensorflow-cpu-windows-x86_64.zip
the tensorflow.dll is 64 bits.
@Nink,
Can you explain your plan B
Regarding the playing cards, we don't need to exchange publicly picture files, let's find a common deck available on both amazons.
OK My deck was this one on Amazon.ca not on amazon.com although I pay $1.25 and they want $7 on amazon. I can put a deck in the mail if you want, or if you can provide a link to your cards on amazon.ca I can order a deck. https://www.amazon.ca/gp/offer-listing/B01A61ZQI8/ref=sr_1_1_olp?ie=UTF8&qid=1522329419&sr=8-1&keywords=victoria+playing+cards&condition=new
Plan B was to just take 100 photos of each card (1 photo every second) in camera app and have background and angle changing using Ipad and record player then move robot head up and down as I do it while creating shadows and changing lighting. Logic was I could get through this in about 2 hours.
I have been thinking about a good use case for the ability to read cards. Obviously a QR code can be read easily so this makes me wonder if this is a good application for vision recognition, but at least we are learning and will make a good demo.
Hi @ptp. I wonder if we need to break the problem into smaller chunks so the problem is easier for Watson to solve and then string these together. We are not providing any negative classes. We only say what each card is, but not what it isn’t.
Colour Cluster. Red cards positive and Black cards negative Is it a red card or black card. this should be a relatively easy problem to solve. Assume red
Red Suit Cluster Hearts positive and diamonds negative. Again relatively easy problem to solve. Assume hearts.
Hearts Suit Cluster Picture positive Numeric Negative. Assume Picture.
Hearts Picture Cluster. Is it jack queen king or ace of hearts.
If we could do this and then somehow feed that back into a machine learning I am wondering if this would be very accurate and quick to process?
Hi @ptp, Do you think you might be able to build an ARC skill plugin that can run TensorFlow like this:
https://www.youtube.com/watch?v=_zZe27JYi8Y
and return the objects' names and their x,y coordinates?
Thomas
Sorry the delay... I'm away for the holidays, school break for Passover / Easter.
subjects:
playing cards I'll get this playing cards: https://www.amazon.com/Bicycle-Playing-Card-Deck-2-Pack/dp/B010F6BAES They are available in the amazon US,UK,CA. It will be more easy to share (not public) the training pictures.
plugin roadmap soon i plan to update the plugin to support training with negative samples, and the assistant (formerly conversation) API, although there are some gaps to fill.
offline object detection This will take a little longer, still researching, most implementations e.g. libraries are 64 bits and require some resources (memory & cpu and when possible a gpu), some plumbing is needed between ARC/plugins and an external application.
@Thomas:
Watson: I presume you managed to find and setup the watson visual recognition.
Machine Learning: The main problem is the Hype regarding machine learning... we are far from machine learning, and AI, and the robots seeing and taking the world ... What we have is massive data dumped to massive computational hardware to obtain some clues...
Tensorflow: With a core I7 without a nvidia GPU and a custom compilation for windows x64 and AVX2 extension enabled (not available in all cpus https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) the quickest model (less accuracy) took 3.5 seconds and a more accurate model took 45 seconds to classify/detect objects in a single photo.
Unless i'm doing something wrong, less time requires a decent GPU and or a different model.
I'll post in another thread more details.
Have you considered building a plugin that expands the capability of ezrobot's existing object training? It's much faster than using tensorflow or yolo on darknet. That's built using Yeremeyev method: http://edv-detail.narod.ru/AVM_main.html
I ordered these ones from amazon.ca I think they are the same. I don't want to get hit with customs and duties for cross border. https://www.amazon.ca/gp/offer-listing/B00001QHVP/ref=olp_twister_child?ie=UTF8&mv_style_name=1&qid=1522716936&sr=8-4
Conversations/Assistant WOW this is cool. Amazing thanks @ptp I did a quick voice to text => Assistant => Text to voice. Works well despite my strong accent. I can see this very valuable when controlling and interacting with robot using voice recognition and general conversation.
Found small problem In the Car example under dialogue there was a dialogue "anything else" true and it had the statement I am not sure about that you can "Turn on my lights" with quotation marks that caused script to crash so I just deleted that dialogue response as it was causing errors. All good.
@Nink, Fixed. EZ-Script does not like quotes, quotes are replaced with ( ' ).
One simple example: Init script (Run before the Watson Assistant)
Assistant Script:
I called him Jarvis
Listen what happens when you try to switch on the radio when the radio is already on
To help the tests you can replay a previous assistant message response:
1522815312 is a previous CaptureId.
thanks @PTP sorry this was my bad. The problem was the person who included quotation marks in a programmed response. This was neither your issue or ez-Robots. What were they thinking ...
I like the new previous response test variable as I get lost in where we are in the conversation. Are we in the middle of a conversation or starting a new one?
Wanna know the best option? Bluetooth earpiece they’re cheap and work really well. eBay or amazon for a few dollars and voila.
Just like the ones that cheesy real estate people where for their phone. Connects to your pc
I was actually going to use a Bluetooth hand held microphone. Had a lot of trouble with NAOs in the past in a crowded room doing demos. The robots built in mic’s pick up all the background noise so in a demo they don’t hear the commands. Hand held mic you can hand from person to person and the robot can only hear the person speaking.
PTP did some amaizing work on allowing you to configure DB levels and delays with the voice capture plugin he wrote so this makes it really easy to tune to the appropriate responses.
Yes the mics in the NAO is horrendous. One of the more well-know promotion images from the ALIZ-E project (fp7alizeproject.files.wordpress.com/2017/10/luther-nao-and-elias.jpg?w=658&h=439) features a child whispering into the "ear" of the robot. Basically, this is also the only way to get it to recognize anything without using external mics
@ptp I am not sure what I am doing wrong, the variables in V7 only pass the first time and then I am not getting any updates to variables when I use the plugin . Maybe it is me? not sure. If I manually set variables they work fine.
@Nink,
Which service / script is not assigning the variables ?
It was all of the services. I was using assistant at time move forward, move backward, robot dance etc and it suddenly stopped working. I closed ez-robot did not fix it but variables would assign first time but not second. Ez-robot also kept hanging So I had to restart it several times (kill process as could not exit app). If I assigned a variable by a script worked fine but not via plugin. I must admit I did not test with other plugins (only Watson running).
Eventually I restarted my windows VM and problem went away. 8GB assigned to Windows only ez-robot running in VM (MAC Parralels) I will monitor today. Maybe I will try running native Win 10 and ez-robot on a Bare metal PC.
Seems to be behaving for me today so hopefully was just my PC. One other question, Occasionally I send no data to Watson using Speech to text (background noise etc) but it causes the Speech to text to stop.
Here is a quick watson assist project I started writing. Sorry learning both EZ-Script and Watson assist so it is really rough but will give everyone an idea. I gave him a bad attitude because he never listens to me. I need to go back and do a lot more work to record what position he is currently in etc and take multiple instructions but if others want to work on together would be great. This is for JD robot.
call from speech to text plugin before speech to text
after speech to text
Put in Watson Assist Script in the plugin.
import as a robotest.json file to watson assist
Regarding the microphone subject (quality)
Anyone tested the Kinect Microphone Array ?
It can't be added to a small robot... and soon later they will stop selling...
What model kinect do you have? I have an old kinect 360 missing a power supply I can hack something together and test. Hopefully the openkinect or Primesense Mate NI drivers still work with windows 10. I use a PS3 eye on my desktop and works really well with voice recognition (great mic) although it would be good to get a kinect working especially if we get a unity plugin.
I have a kinect 2, but I'm having trouble getting the microphone array to pick up sound in windows. I'll look into today and test the recognition if I can get it to work.
Kinect: I'm curious how good is the Kinect in a noisy/open environment.
@Nink: I have: PS3Eye, Asus Xtion, Kinect 1, Kinect 2. All of them have microphone arrays, and kinect has a sound localization api. One can use to turn the robot head towards the sound/person.
Does the PS3Eye work for long distances and/or environment noise ?
I'm evaluating a few microphone arrays, and one of the cheapest solutions is to use the PS3Eye with a Raspberry PI Zero and forward the sound to the PC (wifi microphone)
Post #19 https://synthiam.com/Community/Questions/10781&page=2
Not sure about background noise but PS3 eye is good from a distance (nothing works well with background noise). I think we need to just do a "Launch Name" (like "hey siri", "OK Google" or "echo" etc) using ezb voice rec and hope for the best.
This doesn't solve depth sense issue though.
I am happy to go for a 2 for 1 on a depth sensor (I will buy 2 and send you one) if you want to work on something. I have been waiting for EZ-Robot to provide a LIDAR to do a SLAM. This in conjunction with the unity work going on would be exciting. Maybe we should just look at getting a couple of https://click.intel.com/realsense.html for now.
Off topic or on topic (Not sure any more) I have my latest photos of bicycle cards Ace to 6. I can send you a link to cards off line as you own a deck, but results still are not good. I think I need to string together in multiple AI searches but time delay is an issue and Watson Visual recognition does not seem to support a VR pipeline or linking requests so I have to call multiple VR requests from a single EZB Script based on previous VR outcome ,so takes A LONG TIME. First find Suit (Hearts, Diamonds, Clubs, Spades)=> when suit derived find Picture or number (check if numbers or picture cards)=> then actual card (Number in suit) and it still gets it wrong. Maybe I work on weekend if I have time.
@all I stumbled upon an interesting project lately, I do not know if it is of any value for you guys, but it might be worth checking out...
http://lisajamhoury.com/portfolio/kinectron/
https://kinectron.github.io/docs/intro.html
https://github.com/kinectron/kinectron
have you guys considered this as a mic option?
https://www.seeedstudio.com/ReSpeaker-4-Mic-Array-for-Raspberry-Pi-p-2941.html
"this 4-Mics version provides a super cool LED ring, which contains 12 APA102 programmable LEDs. With that 4 microphones and the LED ring, Raspberry Pi would have ability to do VAD(Voice Activity Detection), estimate DOA(Direction of Arrival) and show the direction via LED ring, just like Amazon Echo or Google Home"
Hi @ptp ,
Thank you for your work on the IBM Watson Plugin. I tried to play with it a little bit and it seems that IBM Watson is now only providing an API Key and URL for the new accounts. So no more credentials with username/password.
https://console.bluemix.net/docs/services/watson/getting-started-tokens.html#tokens-for-authentication :
This link is also mentionning the migration : https://www.ibm.com/watson/developercloud/text-to-speech/api/v1/curl.html?curl#introduction
I could only try with the visual recognition and I have the following error : IBM.WatsonDeveloperCloud.Http.Exceptions.ServiceResponseException : The API query failed with status code Unauthorized :....
Not sure if it's my bad, or some big change were made on the Watson .NET SDK :
Hi @Aryus96 I will send you some credentials off line to my old account. Seems to still work.
@Nink,
I'm investigating... my services are working.
Do you know how to convert an existing service (user/password) to the new authentication method ?
I can't find an upgrade option.
I also cannot find a way to generate a basic authentification (username/password) with my new account.
@ptp I'm also working on a personal plugin that include the Watson .NET SDK, using the 2.11.0 and I'm having some difficulties with the dependencies of system.net.http... Did you encountered some similar problems ?
I deleted the existent service and created a new service. Although London location is not available, the new service is located in Washington DC.
I created a new service on my personal account and I was issued a token. I didn't want to touch my old services as they still work.
I'll update the plugin to support both user/password and the api key/token method. Tokens seems more secure in advanced scenarios where a proxy service creates and delivers time limit tokens, if an attacker gets the token is not a major issue: has a time limit and is not the service's API key.
@ptp No it doesn’t work for me. It’s not because of the Token but versions of the SDK & Dependencies for .Net 4.6.1, as you succeed to make it run, you might be the only one who could help me. What version of the SDK and dependencies are you using ? I opened an issue here with more details : https://github.com/watson-developer-cloud/dotnet-standard-sdk/issues/307
@aryus96,
I tested with a .net core console project.
I'll check with a .NET project.
@aryus96:
I found the issue and is working with a windows.forms .net 4.6.1. I'll redo it again from zero to confirm the fix and I'll update the Watson's open ticket.
@aryus96: I've updated your ticket.
@ptp Wow, Thank you very much ! I will check that out !
Hi @ptp,
I'm starting to understand what's going on.
The TextToSpeech 2.11.0 isn't working with other versions of System.Net.Http than the 4.0.0.
When I run your code, I have to edit the AppConfig and remove the comment from the dependentAssembly of System.net.http otherwise it's loading the 4.1.0.
If I do that, I can run your code without any problem and it's working great.
But as I want to integrate it to ARC, I'm changing the project to a Class Library one, and not Window one. I add the debugger option, compiler, plugin.xml, path, etc.
I can load the plugin on ARC, however, it's loading the System.Net.http 4.1.0 even if the AppConfig is correctly setup. That means, ARC is using System.Net.Http and overwritting the use of the plugin one (4.0.0) ?