Canada
Asked
Resolved Resolved by DJ Sures!

Using Openai Skill With Other Products

OK great I noticed GPT4o says give me any image in any format and I will work it out where everyone else wants base64

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";,
            "detail": "high"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)


ARC Pro

Upgrade to ARC Pro

Unleash your robot's full potential with the cutting-edge features and intuitive programming offered by Synthiam ARC Pro.

PRO
Canada
#9  

Thanks DJ It no longer gives me an image format error on LM Studio with Phi3 Vision model and if I reset everything it works after a few attempts with initially some  messed up responses but I think that maybe the model / tool at my end (I need to look into this but it works local fine) It did successfully describe the image a couple of times. So this is a great start.  .
It did work with openrouter.ai with various models.  This is a model routing engine that uses OpenAI format so should work with all vision models they have. I tested with google gemini via openrouter and we can use that engine (it won't work direct with gemini as they don't support OpenAI)   It also worked with Claude with vision via openrouter this is the model name I used anthropic/claude-3-opus.  I could not get that one to work directly either (maybe config issues as never used via api before).

I will try some other models direct or maybe try LiteLLM. It is an open source OpenAI API gateway so it is good at cleaning up compatibility issues.

If anyone wants to try images with hosted models here is my open router settings and all you have to change is model name for the service or model you are using.  (also works with OpenAI GPT4o so you only have 1 bill)

User-inserted image

PRO
Canada
#10  

I increased image size on camera from 640 * 480 to 1920 * 1080 and increased content length to 8172 so it would have lots of ram to play with. I lowered the temp a bit and tweaked a few other settings.  Things are working much better.  I guess I need to clean up the room

{ "index": 0, "message": { "role": "assistant", "content": " room. The room is messy with clothes and blankets on the bed, a white wall in the background, a doorway to the left of the bed, and a person sitting on the edge of the bed." }, "finish_reason": "stop" }

PRO
Synthiam
#11  

That’s so weird the lower case worked. The documentation says JPEG in capital. But usually specifying the content type is lowercase. Oh well. This is a new industry so I guess we have to give them slack for poor coding:)

PRO
Canada
#12  

Well when I say it’s working I mean it works if you keep sending images and ignore the garbage and eventually you get one back that describes what it sees.  This has to be my end though as it works all the time with the various paid services I have tried now.

I’ll keep trying other local vision models. It would be great if I can send an image every second since apart from electricity it’s free (with ARCx maybe even stream since it’s so much faster). This way I can analyze what I get back. Perform actions when it sees items it is triggered to do and build a table of items it sees in conjunction with location information.

Has anyone seen my wallet, my keys, my phone, the remote.  Yeah you left it on the kitchen table, the BBQ etc.

My wife likes the idea as well being local on my computer versus cloud. She walked in room when I was playing with openAI and it described her and everything around her and she was not impressed that it had all this data.

PRO
Canada
#13  

I tried a few more local models. The model from Microsoft Phi 3 is the one that gave me most of the problems so don't use (I went with this one because it actually came from a reputable company but turns out it sucks big time)

So far this one seemed to work the best for Vision with ARC and LM Studio locally  Lewdiculous/Eris_PrimeV4.69-Vision-32k-7B-GGUF-Imatrix  It is meant for role play games so don't have a conversation with it. It will drive you crazy.  Create a prompt to say keep description VERY SHORT and maybe limit tokens so it doesn't waffle on.     https://huggingface.co/Lewdiculous/Eris_PrimeV4.69-Vision-32k-7B-GGUF-Imatrix

This should run on any modern video card with 16GB or more of VRAM example Nvidia RTX 3080/3090/4080/4090 and AMD RX 6800XT/6900XT/7900XT and will give you results in < 1 second.

PRO
Synthiam
#14   — Edited

It does stream. That’s how all data is transferred. If you want to run it in a loop, do so. You don’t need to handle processing in the action script. Use the variable response inline after the control command.

if you want the robot to move based on image data, tell the ai.

write to the ai in the description that it is an arm. Describe the lengths. Describe the servo positions. Describe the servo ports. And provide the current servo values.

and ask the ai to return a list of new servo values. Parse the list and move the servos.

remember - you don’t need anything special - you’re in charge of what the ai does by telling it. That’s what is great about llm’s. You don’t have to be a programmer - you just have to be good at explaining tasks.

PRO
Canada
#15  

sadly my local AI LLMs are no where near as smart as GPT4o or Google Deepmind /PaLM-E in local robot control.   They are lucky if they know what a servo is let alone how to calculate positions and servo values to perform inverse kinematics.  The good news is you can fine tune existing models like llama2/3 and given enough data teach them about robotics, sensors, actuators etc.   I guess if I train it on all of these papers and synthiam documentation it should make a good starting point.  https://github.com/GT-RIPL/Awesome-LLM-Robotics

BTW is there an "ARC user Manual" in a PDF format.  I could scrape the data from the website I guess but if you have a nice collection of training data would be helpful.

PRO
Synthiam
#16  

I don't believe fine-tuning is the terminology you're looking to use. Embedding would be more appropriate.

  • Fine-tuning is prioritizing vectors of an existing model
  • Embedding is adding vectors to an existing model

The great thing about LLM is that as a programmer, you don't need to program:). You talk to it as if you're talking to a person. So if you want it do so something, you provide the details in verbal instructions. You must also specify the response you expect. The LLM list you provided looks like there's several attempts that you can play with.

Parsing the response will be your only "programming" effort. Because you're most likely going to ask the servo responses in a comma-separated list of PORT=POSITION