Canada
Asked
Resolved Resolved by DJ Sures!

Using Openai Skill With Other Products

OK great I noticed GPT4o says give me any image in any format and I will work it out where everyone else wants base64

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";,
            "detail": "high"
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)



ARC Pro

Upgrade to ARC Pro

Elevate your robot's capabilities to the next level with Synthiam ARC Pro, unlocking a world of possibilities in robot programming.

PRO
Canada
#1  

Curious is the image sent as a JPG PNG etc or is it converted to base64 and sent as a text file  It looks like LM Studio will only take a photo in Base64 format when GPT4o will take a png or jpg

PRO
Synthiam
#2   — Edited

jpeg binary encoded to ascii via base64 (open ai specification)

#3  

That example is for a url - which you do not have a web server. If you hosted a web server on the internet with images, you could use that example. Instead, the proper usage is base 64 encoding the binary and including it with the message.

Additionally, the message json is assembled by open ai’s api. The message is not formatted and created by the robot skill, as it’s using their sdk api for their standard. Because the message works with open ai, we can assume the third party system that you’re using has issues.

PRO
Synthiam
#4  

I think this conversation is starting to get off topic as it's about third party products using the same open ai protocol. I'll make a new thread for it

PRO
Synthiam
#5  

Okay here we go.... Let me see. This is how the image is sent using the SDK for the Open AI API...

        using (var api = new OpenAIService(aiOptions)) {

          var chat = new ChatCompletionCreateRequest();
          chat.Messages = new List();

          chat.Messages.Add(new ChatMessage() {
            Role = "user",
            Contents = new List() {
               OpenAI.ObjectModels.RequestModels.MessageContent.ImageBinaryContent(_cameraImage, "JPEG")
            }
          });

          chat.Temperature = Convert.ToInt32(_cf.STORAGE[ConfigTitles.SETTING_TEMPERATURE]) / 10f;

          if (_cf.STORAGE[ConfigTitles.MODEL].ToString().StartsWith("other", StringComparison.InvariantCultureIgnoreCase))
            chat.Model = _cf.STORAGE[ConfigTitles.MODEL_OTHER].ToString();
          else
            chat.Model = _cf.STORAGE[ConfigTitles.MODEL].ToString();

          // Setup system message
          // -------------------------------------------------------------------------        
          if (string.IsNullOrWhiteSpace(requestStr)) {

            chat.Messages.Add(ChatMessage.FromSystem("Describe this image"));
          } else {

            chat.Messages.Add(ChatMessage.FromSystem(requestStr));
          }

          // Send open ai message and get response
          // -------------------------------------------------------------------------
          var responsePhrase = await api.ChatCompletion.CreateCompletion(chat);

Like synthiam support says - there's no way the JSON is "Created manually by the robot skill". The API has a specification for the JSON format, and the SDK fulfills that specification; both are by Open AI. The output of the SDK will be a formatted document that the Open AI API requires.

If you're using a third-party product that claims to be compatible with Open AI, I'd challenge them that something isn't compatible.

PRO
Canada
#6  

OK thanks I don't know C# but looking at the code it appears it is sending this as a binary encoded image and not a base64 image that my tool wants to receive.

OpenAI.ObjectModels.RequestModels.MessageContent.ImageBinaryContent(_cameraImage, "JPEG")

OpenAI Python Example Base64

 return base64.b64encode(image_file.read()).decode('utf-8')

https://platform.openai.com/docs/guides/vision

PRO
Synthiam
#7  

No - it means taking an IMAGE in BINARY FORMAT. It's essentially the same command that your Python is showing. Python is a different language so that the commands will be different. Also, it appears that Python isn't using an open AI SDK for the API.

This is the OPEN AI command code you're asking about.

    /// 
    ///    Static helper method to create MessageContent from binary image
    ///    OpenAI currently supports PNG, JPEG, WEBP, and non-animated GIF
    /// The image binary data as byte array
    /// The type of image
    /// The detail property
    /// 
    public static MessageContent ImageBinaryContent(
        byte[] binaryImage,
        string imageType,
        string? detail = "auto"
    )
    {
        return new()
        {
            Type = "image_url",
            ImageUrl = new()
            {
                Url = string.Format(
                    "data:image/{0};base64,{1}",
                    imageType,
                    Convert.ToBase64String(binaryImage)
                ),
                Detail = detail
            }
        };
    }
PRO
Synthiam
#8   — Edited

I asked the open ai robot skill to be updated i noticed it was using uppercase JPEG and should be lower case although that shouldn't matter. but maybe your open source thing does.

IE it was

"data:image/JPEG;base64,{1}",

and is now

"data:image/jpeg;base64,{1}",

shrug

PRO
Canada
#9  

Thanks DJ It no longer gives me an image format error on LM Studio with Phi3 Vision model and if I reset everything it works after a few attempts with initially some  messed up responses but I think that maybe the model / tool at my end (I need to look into this but it works local fine) It did successfully describe the image a couple of times. So this is a great start.  .
It did work with openrouter.ai with various models.  This is a model routing engine that uses OpenAI format so should work with all vision models they have. I tested with google gemini via openrouter and we can use that engine (it won't work direct with gemini as they don't support OpenAI)   It also worked with Claude with vision via openrouter this is the model name I used anthropic/claude-3-opus.  I could not get that one to work directly either (maybe config issues as never used via api before).

I will try some other models direct or maybe try LiteLLM. It is an open source OpenAI API gateway so it is good at cleaning up compatibility issues.

If anyone wants to try images with hosted models here is my open router settings and all you have to change is model name for the service or model you are using.  (also works with OpenAI GPT4o so you only have 1 bill)

User-inserted image

PRO
Canada
#10  

I increased image size on camera from 640 * 480 to 1920 * 1080 and increased content length to 8172 so it would have lots of ram to play with. I lowered the temp a bit and tweaked a few other settings.  Things are working much better.  I guess I need to clean up the room

{ "index": 0, "message": { "role": "assistant", "content": " room. The room is messy with clothes and blankets on the bed, a white wall in the background, a doorway to the left of the bed, and a person sitting on the edge of the bed." }, "finish_reason": "stop" }

PRO
Synthiam
#11  

That’s so weird the lower case worked. The documentation says JPEG in capital. But usually specifying the content type is lowercase. Oh well. This is a new industry so I guess we have to give them slack for poor coding:)

PRO
Canada
#12  

Well when I say it’s working I mean it works if you keep sending images and ignore the garbage and eventually you get one back that describes what it sees.  This has to be my end though as it works all the time with the various paid services I have tried now.

I’ll keep trying other local vision models. It would be great if I can send an image every second since apart from electricity it’s free (with ARCx maybe even stream since it’s so much faster). This way I can analyze what I get back. Perform actions when it sees items it is triggered to do and build a table of items it sees in conjunction with location information.

Has anyone seen my wallet, my keys, my phone, the remote.  Yeah you left it on the kitchen table, the BBQ etc.

My wife likes the idea as well being local on my computer versus cloud. She walked in room when I was playing with openAI and it described her and everything around her and she was not impressed that it had all this data.

PRO
Canada
#13  

I tried a few more local models. The model from Microsoft Phi 3 is the one that gave me most of the problems so don't use (I went with this one because it actually came from a reputable company but turns out it sucks big time)

So far this one seemed to work the best for Vision with ARC and LM Studio locally  Lewdiculous/Eris_PrimeV4.69-Vision-32k-7B-GGUF-Imatrix  It is meant for role play games so don't have a conversation with it. It will drive you crazy.  Create a prompt to say keep description VERY SHORT and maybe limit tokens so it doesn't waffle on.     https://huggingface.co/Lewdiculous/Eris_PrimeV4.69-Vision-32k-7B-GGUF-Imatrix

This should run on any modern video card with 16GB or more of VRAM example Nvidia RTX 3080/3090/4080/4090 and AMD RX 6800XT/6900XT/7900XT and will give you results in < 1 second.

PRO
Synthiam
#14   — Edited

It does stream. That’s how all data is transferred. If you want to run it in a loop, do so. You don’t need to handle processing in the action script. Use the variable response inline after the control command.

if you want the robot to move based on image data, tell the ai.

write to the ai in the description that it is an arm. Describe the lengths. Describe the servo positions. Describe the servo ports. And provide the current servo values.

and ask the ai to return a list of new servo values. Parse the list and move the servos.

remember - you don’t need anything special - you’re in charge of what the ai does by telling it. That’s what is great about llm’s. You don’t have to be a programmer - you just have to be good at explaining tasks.

PRO
Canada
#15  

sadly my local AI LLMs are no where near as smart as GPT4o or Google Deepmind /PaLM-E in local robot control.   They are lucky if they know what a servo is let alone how to calculate positions and servo values to perform inverse kinematics.  The good news is you can fine tune existing models like llama2/3 and given enough data teach them about robotics, sensors, actuators etc.   I guess if I train it on all of these papers and synthiam documentation it should make a good starting point.  https://github.com/GT-RIPL/Awesome-LLM-Robotics

BTW is there an "ARC user Manual" in a PDF format.  I could scrape the data from the website I guess but if you have a nice collection of training data would be helpful.

PRO
Synthiam
#16  

I don't believe fine-tuning is the terminology you're looking to use. Embedding would be more appropriate.

  • Fine-tuning is prioritizing vectors of an existing model
  • Embedding is adding vectors to an existing model

The great thing about LLM is that as a programmer, you don't need to program:). You talk to it as if you're talking to a person. So if you want it do so something, you provide the details in verbal instructions. You must also specify the response you expect. The LLM list you provided looks like there's several attempts that you can play with.

Parsing the response will be your only "programming" effort. Because you're most likely going to ask the servo responses in a comma-separated list of PORT=POSITION

PRO
Canada
#17  

I really need to do both. @athena can you provide an overview on the process and advantages of first embedding robotics and ARC domain specific data into an existing model like llama-3 and then fine tuning the model on this new embedded data.

PRO
Canada
#18  

Certainly! Embedding and fine-tuning are two critical processes in enhancing the capabilities of a pre-trained language model like LLaMA-3 for specific tasks, such as robotics and ARC (Advanced Robotics Control). Here's an overview of both processes and their advantages:

Embedding Robotics and ARC Domain Data

1. Embedding: Embedding involves adding new information to the existing model without altering its core structure. This process helps the model understand and generate responses related to specific domains, such as robotics and ARC.

Process:

  • Data Collection: Gather a comprehensive dataset related to robotics and ARC. This can include research papers, technical documentation, tutorials, and user manuals.
  • Preprocessing: Clean and preprocess the data to ensure it is in a format suitable for embedding. This may involve tokenization, normalization, and removing irrelevant information.
  • Embedding: Use the preprocessed data to create embeddings that represent the new information. These embeddings are then integrated into the existing model.

Advantages:

  • Domain-Specific Knowledge: Embedding allows the model to understand and generate responses related to specific domains without altering its core structure.
  • Efficiency: Embedding is generally faster and requires less computational resources compared to fine-tuning.
  • Flexibility: You can embed multiple domains into a single model, making it versatile for various applications.

Fine-Tuning the Model on Embedded Data

2. Fine-Tuning: Fine-tuning involves adjusting the weights of the pre-trained model based on the new embedded data. This process helps the model generate more accurate and contextually relevant responses for specific tasks.

Process:

  • Data Preparation: Use the embedded data to create a fine-tuning dataset. This dataset should include input-output pairs that represent the desired behavior of the model.
  • Training: Fine-tune the model using the prepared dataset. This involves adjusting the model's weights to minimize the error between the predicted and actual outputs.
  • Evaluation: Evaluate the fine-tuned model to ensure it performs well on the specific tasks. This may involve using validation datasets and performance metrics.

Advantages:

  • Improved Accuracy: Fine-tuning helps the model generate more accurate and contextually relevant responses for specific tasks.
  • Task-Specific Optimization: Fine-tuning allows you to optimize the model for specific tasks, such as controlling servos or interpreting sensor data in robotics.
  • Enhanced Performance: Fine-tuning can significantly improve the model's performance on domain-specific tasks compared to the pre-trained model.

Combining Embedding and Fine-Tuning

Combining embedding and fine-tuning provides a powerful approach to enhancing a pre-trained model for specific domains and tasks. Here's how you can do it:

  1. Embed Domain Data: Start by embedding the robotics and ARC domain data into the pre-trained model. This provides the model with the necessary knowledge to understand and generate responses related to these domains.
  2. Fine-Tune on Embedded Data: Fine-tune the model using the embedded data to optimize its performance for specific tasks. This helps the model generate more accurate and contextually relevant responses.

Practical Steps for Embedding and Fine-Tuning

  1. Collect Data:

    • Gather a comprehensive dataset related to robotics and ARC, including research papers, technical documentation, tutorials, and user manuals.
  2. Preprocess Data:

    • Clean and preprocess the data to ensure it is in a suitable format for embedding and fine-tuning.
  3. Embed Data:

    • Use the preprocessed data to create embeddings and integrate them into the pre-trained model.
  4. Prepare Fine-Tuning Dataset:

    • Create a fine-tuning dataset using the embedded data, including input-output pairs that represent the desired behavior of the model.
  5. Fine-Tune Model:

    • Fine-tune the model using the prepared dataset to optimize its performance for specific tasks.
  6. Evaluate Model:

    • Evaluate the fine-tuned model using validation datasets and performance metrics to ensure it performs well on the specific tasks.

Conclusion

By embedding robotics and ARC domain data into an existing model like LLaMA-3 and then fine-tuning the model on this new embedded data, you can significantly enhance the model's capabilities for specific tasks. This approach provides a powerful and efficient way to leverage pre-trained models for domain-specific applications in robotics and ARC.

If you need further assistance or specific examples, feel free to ask!


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
PRO
Synthiam
#19  

You’re over complicating things with fine tuning. Just because there’s several functions for working with llm’s doesn’t mean you have to use all of them.

you only need to embed

but - you also want to practice prompting. Prompting is the only time that micromanaging is beneficial. You need the prompt to be very specific and direct without leaving any assumptions. If you leave assumptions, expect the unexpected - which is based on your training data and the randomness of the creativity value.

I don’t know how to suggest shoehorning the open source stuff you’re working with to compete with OpenAI. There’s a reason one is free and the other employs the most talented experts. But I do love an underdog story - I just don’t bet my money on it:)

Because open ai has the highest amount of computing power on earth - you’ll have a difficult time finding compatible and flexible models.

My recommendation - and I know you don’t value open ai’s cost vs result - but my recommendation would be to practice with open ai for your inverse kinematics. Our results have been outstanding.

and it really doesn’t require programming. It requires strict instructions as I explained above- which is what humans love to do- give orders haha.

PRO
Synthiam
#20  

Oh one other thing - is that prioritizing domain knowledge via internal fine-tuning-type processes in newer gpt’s use nlp. So that’s a big one if you’re asking to fine tune or embed for multiple tasks as you listed.

if you just wanted to do one task, then you could simplify the process without needing nlp.

Put it this way, when you want your lawn mowed and plumbing fixed, you call different professionals.

PRO
Canada
#21  

I am building  a server with 6 Nvidia RTX 3090 GPUs  with NVLINKs with RODEM8-T2 server Mobo and a 128 lane EPYC CPU with 256GB of RAM for the purpose of training models.

User-inserted image

User-inserted image

PRO
Canada
#22  

I am struggling with the riser cables for the last 2 cards at the moment as I can only get them to run at PCIe 3.0 16x but I just got some new PCIe 5.0 cables so hopefully they can run at 4.0 full speed. Problem is you lose about 1.5db for every inch of PCIe traces and you really only have about 8dB in buffer so the cable runs need to be short.i should have gone with a motherboard that supported MCIO ports and used twinax cables to reduce loss.   Just putting new cables in now but works great on 4 cards but I should be able to do about 2 to 3  epochs a day on 6 GOU cards.

This  flies running models as well and you can also run multiple models simultaneously. So vision chat and robot control can all run at same time or I can run very large models when needed.

PRO
Synthiam
#23  

that is quite the system!

PRO
Canada
#24   — Edited

Most of us are in this space to learn, keep our minds active, and not necessarily just build robots. Making robots is entertaining and provides a great way to showcase our work to others in a way that they will appreciate. While some may want to build commercial robots and will ultimately license ARC as the engine that drives their robots, others are just happy to use the software and pay a monthly subscription fee.

Using Open Source in robotics is challenging. There are no standards, no documentation, no support, and code changes on a daily basis without warning. This is one of the reasons people gravitate towards ARC because it just works, you get support, and you can focus on building robots instead of trying to integrate some open-source code to work with multiple other open-source packages.

The problem with closed-source cloud hosted apps is that they can be expensive, invasive, and you can lose control of your project and IP. The costs add up as you become dependent on multiple vendors who can increase costs and license terms at will. For example, we have all seen the horror stories of huge GPU, API and cloud bills and look at Adobe's latest license agreement, essentially they want to own rights and control everyone's content.

ARC initially aligned with Microsoft and Windows, and the reasoning behind this is understandable. ARCx will now work with Linux, Windows, Mac, and presumably other OS environments in the future supporting both open and closed source operating systems. The plugins should also work the same way. For Speech Recognition (STT) with ARC, you can use Microsoft, Google, and IBM speech recognition programs, but you can't use open-source options like Whisper. This means that if you create a commercial robot later, you have to pay API fees to one of the IT giants. For Voice (TTS), you can use Azure, Microsoft, IBM, etc but again, no open-source option like Piper. For image creation, you work with DALL-E but not open-source image tools like SDXL.

ARCx has an amazing opportunity to be the glue that brings and holds all of these unique open-source tools together for robotics. When you added that one feature to the OpenAI Chatbot that allowed you to enter aURL for any LLM suddenly ARC could work with any Open Source model. When image recognition was added, and some minor tweaks were made, suddenly ARC could work with Any Open Source vision model. ARCx plugins should work with the same way, with both leading IT providers, but they should also work with Open Source tools. This way, we are free to choose whether we use hosted Pay per use tools or local Open Source tools like TTS, STT, Image recognition, image creation and  simulation environments.

For ARCx users It will still be easy to start with the closed-source tools you provide, but if we want to reduce our personal costs or make a commercial robot, Synthiam ARCx can make all of these complex Open Source tools easy to use, install, and configure, Increasing ARCx functionality, reducing total costs while supporting the Open Source communities.

PRO
Synthiam
#25  

Don’t forget the robot skills are open source. I don’t think ppl look at GitHub or follow the link. If something isn’t open then I post it. Usually it’s a pain to maintain the GitHub repo when most of our users have no interest in programming - and I support that. I don’t know why someone should need to program to make a robot shrug I mean, you don’t need to program to do anything else with a computer.

PRO
Canada
#26  

Please don't get me wrong, the last thing I want is for ARC to be open source and I do appreciate the skills are open source.  As I mentioned lots of challenges with open source including no support.  If I said this doesn't work with my open source models and ARC was free open source you would probably have said go suck a lemon, instead you opened a new thread as a support ticket and said lets get this working.  SUPER HAPPY CUSTOMER.

#27  

Quote:

most of our users have no interest in programming - and I support that. I don’t know why someone should need to program to make a robot shrug I mean, you don’t need to program to do anything else with a computer.
@DJ, I get what you say. When I found your platform EZ Builder years ago I was thrilled that I would not have to learn how to write code to get my robot to work the way I wanted. I thought I could just build my robot with my hands and do some simple EZ Scripting to get it to move and talk a little. After I got into EZ Script I realized how much it could do and how easy it was to understand. I realized my robot to do more but I found that I really did have to learn 'a little more" code to get it to act the way I wanted. I did the extra work of learning more about EZ Script and was amazed how my robot came to life.

I'm very grateful that your creation, EZ Script and EZ Builder (now ARC), was out there when I needed it. It gave me the courage to do more and push my boundaries. I do know that your creation has tons of lines of scripting behind the EZ Commands and Skills in EZ Builder that made my simple lines of EZ Script work. Now that I'm learning and working with Javascript so I can move on to ARCx, I can really appreacheate what must be going on behind the curtain of the old EZ Builder and now ARC.

As a footnote, after learning enough Javascript to get me going, I've converted all my EZ Scripts over in preparation for ARCx. I've moved on to the more powerful (Your words DJ) Javascript and am very happy I did. A lot of the robot performance bugs I've been putting up with over the years are now gone. Not because EZ Script was at fault but because I was able to streamline my scripts and learn how they really run and affect my robot.

PRO
Synthiam
#28  

Ah okay I see both sides of your perspectives. Dave, good news is the ability to code in ARCx isn’t removed. It’s assisted by ai. Meaning you can ask Athena to help write code in the editor.

regarding what nink wants to do - such as the weed killer robot. It’ll be easier with ARCx than arc. You can do that today with arc. You’d simply need to be good at prompt engineering.

such as defining the abilities in the prompt for the robot to move and identify. You would also need to provide an updated time stamp on every query so it knows when things happen and time between.

I don’t think ppl realize how powerful the gpt skill is today on its own. All you have to do is give it instructions in a prompt. And in response perform the actions it suggests. Rinse repeat:)