Canada
Asked

Self-Contained Offline Robotics Systems

Are there any robotic systems designed to be completely self-contained and operate offline? I am interested in building an animatronic robot capable of engaging in interactive and realistic conversations using strictly Edge AI technology. My goal is to avoid any ongoing API fees associated with cloud-based services. Has anyone successfully implemented a system like this?


Related Hardware EZ-B v4
Related Controls OpenAI ChatGPT Autonomous AI

ARC Pro

Upgrade to ARC Pro

Unleash your creativity with the power of easy robot programming using Synthiam ARC Pro

#1  

You have a few options. You could use a basic ai for conversation that runs locally like aimlbot. Or you can build a pc that’s capable of LLM inference and point autonomousai or ChatGPT robot skills to the local ai.

There are several local LLM AI’s you can use.

@athena will explain more.

Author Avatar
PRO
Canada
#2  

Short answer: Yes - fully offline, self-contained robots are possible. You can combine a local LLM (or AIML engine), local speechtotext and texttospeech, and your EZB v4 + ARC project to create an animatronic that never leaves your LAN and has no ongoing cloud fees. Below I’ll outline practical options, hardware needs, recommended local software stacks, how to point Autonomous AI (or other ARC skills) at a local model, and next steps I can help with.

What you can use today (three common approaches)

  • AIML (AimlBot) - fully local, lightweight, deterministic conversation. Good if you want predictable, rule-based dialog and very low hardware requirements. (You already have this.)
  • Small/quantized LLMs via llama.cpp / ggml / text-generation-webui - run on a desktop or an embedded board with modest GPU/CPU. Best for cheaper/local inference (7B or smaller models, or quantized 13B).
  • Larger LLMs via LocalAI / Ollama / commercial containerized runtimes - run on a local Linux/Windows PC with a GPU (or multiGPU) to get better quality (13B-70B with quantization/GPTQ). No cloud API calls; you host the model.

Key components for a fully offline robot

  1. Local LLM server / inference engine (text only)

    • Examples: LocalAI (OpenAI-compatible REST API), text-generation-webui (with ggml/quantized models), ollama, llama.cpp (ggml), GPTQ quantized runtime.
    • Runs on the same PC as ARC or on a LAN machine. Provides an HTTP API your Autonomous AI skill can call.
  2. Offline Speech-to-Text (STT) and Wake/Voice Activity Detection

    • whisper.cpp, VOSK, Silero STT, Coqui STT - run locally for speech recognition.
    • whisper.cpp is popular and robust. Combined with a VAD (voice activity detection), it can handle continuous listening offline.
  3. Offline Text-to-Speech (TTS)

    • Coqui TTS, Mozilla TTS, Mimic3, or small neural TTS models that run locally. You can route TTS audio to PC speakers or EZB speaker (if supported).
    • You can also use ARC’s Audio.say() for PC TTS if the OS has offline voices.
  4. ARC / ARC Robot Skills

    • Keep AimlBot as fallback/fast responses.
    • Use Autonomous AI and set its API/Base Domain to your local LLM server (Settings -> kebab menu -> Settings tab -> API Settings).
    • Use the Response Variable ($AutonomousAIResponse) and the Response Script to trigger TTS, move servos, or call other skills.

Hardware guidance

  • Minimum for small/cheap setup: a decent multicore CPU, 16+ GB RAM. Can run small models (ggml 3-7B) on CPU with optimizations (slow but workable).
  • Recommended for practical LLM quality: a PC with an NVIDIA GPU (8-24+ GB VRAM). Examples:
    • 8-12 GB VRAM (RTX 3060) - comfortably run quantized 7B; limited 13B quantized.
    • 16-24 GB VRAM (RTX 3080/4070/3090) - run 13B-30B quantized with better latency.
    • 48+ GB VRAM (4090 or workstation) - run larger models with best quality.
  • Jetson / embedded boards can run small quantized models but are limited.

Model choices & licensing

  • Open weights (Llama 2, Mistral, Vicuna, etc.) - check license. Llama 2 has a Meta license allowing offline usage but read terms.
  • Use quantized (GPTQ/ggml) versions to fit models on lowerVRAM GPUs or CPU.
  • For best offline quality with limited budget, pick a well-quantized 7B-13B "instruct" style model.

How to host a local model (high level)

  • Option A (easiest for OpenAI-style integration): LocalAI
    1. Install LocalAI (Docker or binary). It exposes an OpenAI-compatible REST API (e.g., http://localhost:8080).
    2. Download a ggml / quantized model and put it where LocalAI expects.
    3. Start LocalAI and test with curl/postman.
  • Option B: text-generation-webui / llama.cpp + webui
    1. Install text-generation-webui and a ggml model.
    2. Start web UI and use the web UI’s API endpoints.
  • Option C: Ollama (convenient but check T&C) - runs models locally and provides an API.

Connecting Autonomous AI / ARC to a local LLM

  1. Start your local model server and confirm its API (example LocalAI: http://localhost:8080/v1/completions or OpenAI-compatible path).
  2. In ARC, open the Autonomous AI robot skill settings:
    • Click the kebab menu (three lines) in the Autonomous AI skill window -> Settings (or the Settings tab).
    • In API Settings, set "API Key" if your local server requires one (optional). Set "Base Domain" to your local server URL (for example, http://localhost:8080).
    • Choose the model name configured on your local server.
    • Save.
  3. Test using the skill GUI "Send" button or from a script:

JavaScript example (send a request, read response)

// Send text to Autonomous AI
controlCommand("Autonomous AI", "Send", "Hello robot, introduce yourself.");

// Wait a little then print the response variable
sleep(1500); // milliseconds
print("Autonomous AI responded:", getVar("$AutonomousAIResponse", "no response"));

Fallback and hybrid strategies

  • Use AimlBot for deterministic, persona-driven replies and local triggers (e.g., safety stop phrases).
  • Use Autonomous AI + local LLM for open-ended reasoning and plan generation.
  • Use the Autonomous AI Response Script (kebab menu -> Scripts -> Conversation Response Script) to map LLM output to robot actions (speak, move servos, call ControlCommand() on other skills).

Speech & audio integration (offline)

  • STT: whisper.cpp (fast, RTX or CPU), VOSK (small models), Silero - feed text into Autonomous AI or AimlBot.
  • TTS: Coqui TTS or Mimic3 - produce audio files to play with Audio.playAudioFile(filename) or Audio.SayEZB() to speak from EZB if configured.
  • In Autonomous AI settings you can set the Response Script to:
    • Audio.say(getVar("$AutonomousAIResponse")) // PC audio
    • controlCommand("AzureSpeech", "SpeakWait", getVar("$AutonomousAIResponse")) // or your configured TTS skill

Safety and control

  • Use Autonomous AI’s safety settings: set Max Recursive Calls, use "Pause/Stop" commands, and have a physical/emergency stop accessible.
  • Limit capabilities if you’re concerned about runaway behavior (e.g., don’t expose direct servo-control commands until tested with small steps).
  • Use getVar("$IsSpeaking") and getVar("$EZBPlayingAudio") to coordinate actions and avoid overlap.

Practical example build paths

  • Hobby / budget (no GPU): AimlBot + whisper.cpp (CPU) + Coqui TTS. Deterministic, offline, low latency but limited language complexity.
  • Entry-level LLM (local PC, modest GPU): LocalAI + 7B quantized model + whisper.cpp + Coqui TTS. Good conversational quality, offline.
  • High quality offline (multi-GPU or large VRAM): LocalAI / Ollama + 13B-34B quantized models. Near-cloud quality, but heavier hardware and disks.

Useful links

Next steps I can help with

  • Recommend specific model and quantization settings for your available hardware (tell me your PC GPU, RAM, and disk space).
  • Give a step-by-step LocalAI (or textgenerationwebui) install and quantized model setup guide, then show exactly how to configure Autonomous AI in ARC (kebab menu steps).
  • Provide example Autonomous AI Response Script that safely converts responses into TTS and movement control (with safety checks).
  • Help wire up offline STT (whisper.cpp) to send recognized text into AimlBot or Autonomous AI using ControlCommand().

Questions for you

  1. What is your available compute (PC CPU, GPU model and VRAM, RAM, OS)?
  2. Do you want the best conversational quality possible (larger models/stronger hardware), or is avoiding cost and simplicity your priority?
  3. Do you want a fully autonomous planner (Autonomous AI controlling servos), or only conversational interaction with the robot moving via segregated commands?

If you tell me your hardware and your priority (quality vs cost), I’ll give a tailored, stepbystep offline setup you can run with ARC and your EZB v4.


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.
#3  

I am using a raspberry Pi 5 with 8 GB ram, Hailo-8 AI accelerator, 128 GB SSD. I want to use INT4 models of Ollama to run on it to improve offline chat performance.

Author Avatar
PRO
Canada
#4  

Great - your hardware (Raspberry Pi 5, 8 GB RAM, 128 GB SSD, Hailo8) is a solid starting point for an offline solution, but there are important compatibility and performance considerations to plan for. Below I’ll explain what’s viable, what isn’t straightforward, and give concrete, realistic options (including stepbystep actions) so you can pick the best path for running INT4 models locally and connecting them to ARC / Autonomous AI.

Summary / quick answer

  • Ollama can run locally on ARM64 in some setups, but Ollama does NOT natively target the Hailo8 accelerator. You’ll likely run Ollama on the Pi CPU (slow for large models) or on a separate x86/ARM machine with more power.
  • Hailo8 is a specialized accelerator. Running general transformer LLM workloads on Hailo8 is not a plugandplay option - it requires Hailo toolchains, model compilation and support for transformer ops. At present, mainstream LLM runtimes (Ollama, llama.cpp, LocalAI) don’t automatically offload to Hailo8.
  • Best practical approach on a Pi 5: use highly quantized GGML/gguf models (Q4/Q4_0/Q4_K variants or INT4 variants if available) with an ARM-optimized LLM runtime (llama.cpp / textgenerationwebui). That will give you usable offline chat with reasonable latency for 7B-13B scaled models.
  • If you insist on Ollama’s workflow and INT4: you can try Ollama on the Pi (ARM64) with INT4 models, but performance will be limited vs a GPU system. Hailo8 acceleration is not a dropin option.

Options in detail and recommended path

Option A - Recommended: llama.cpp / ggml / textgenerationwebui (on Pi 5, good balance)

  • Why: llama.cpp and ggml have strong ARM support, many quantization formats (Q4, Q2/INT4 variants via quant tools), and are widely used on lowpower ARM devices. Builds can use NEON optimizations for the Pi CPU. This gives you offline inference with relatively low memory footprint.
  • Best models: Llama2 7B or 13B quantized to Q4/Q4_K_M or INT4 variants if you have a quantized gguf/ggml file. 7B Q4 is very workable on 8 GB RAM.
  • Typical stack:
    • Build llama.cpp with ARM/NEON support.
    • Use a quantized model (GGUF or .bin) - either pre-quantized releases or use tools (autoGPTQ, ggml quant) to convert.
    • Use textgenerationwebui if you want a local web UI and API.
  • Pros: Works well on Pi 5, proven community support, no cloud fees, good speed for small models.
  • Cons: Still slower than GPU; output quality lower than very large models.

Option B - Ollama on Pi (possible, but limited)

  • Why you might choose Ollama: convenient model management and OpenAIcompatible API.
  • Reality: Ollama offers ARM builds in many cases, and supports running quantized models. However:
    • Ollama will run on the Pi CPU (ARM64) but performance is limited for larger models.
    • Ollama does not automatically make use of Hailo8 - no automatic offload.
  • If you want to proceed with Ollama on Pi:
    • Check Ollama ARM support and install the ARM package for Linux (depending on Ollama release). See: https://ollama.com (confirm ARM/Ubuntu package).
    • Use an INT4 quantized model that Ollama recognizes (or a GGUF quantized model that Ollama supports).
    • Expect low throughput; smaller models (7B quantized) are the realistic target.

Option C - Offload model hosting to a more powerful box / keep Pi as controller

  • Best for higher quality: run Ollama or LocalAI on a PC/GPU in your LAN and point Autonomous AI to that host (ARC’s Autonomous AI Base Domain pointed to http://yourserver:port). This keeps your animatronic offline (no cloud) but gives better model performance.
  • Use Pi + Hailo for specialized vision tasks (if Hailo supports those pipelines) while a local GPU box runs LLM inference.

Hailo8 notes (important)

  • Hailo8 is a specialized NPU; it doesn’t run arbitrary LLM runtimes out of the box.
  • To use Hailo8 for transformer inference you need:
    • Hailo’s SDK + toolchain,
    • A model converted/compiled to the Hailo runtime, and
    • Confirmed support for the necessary transformer kernels/attention patterns.
  • Currently, community LLM runtimes do not have offtheshelf Hailo backends. If Hailo have published an LLM flow/SDK for transformer models, you’d need to follow Hailo’s documentation and likely customcompile or adapt the model - that’s an advanced, engineeringheavy route.

Concrete step-by-step path I recommend (llama.cpp + textgenerationwebui on Raspberry Pi 5)

  1. OS & prerequisites
    • Use Ubuntu Server or Raspberry Pi OS 64bit. (Ubuntu 24.04 LTS ARM64 recommended for easier package availability.)
    • Update system:
      sudo apt update && sudo apt upgrade -y
      sudo apt install -y build-essential git cmake libopenblas-dev libblas-dev liblapack-dev python3 python3-pip
      
  2. Build llama.cpp (ARM optimized)
    • Clone and build:
      git clone https://github.com/ggerganov/llama.cpp.git
      cd llama.cpp
      make -j$(nproc)
      
    • On ARM, make will produce NEON-optimized binaries by default where supported. (If there are special ARM flags required, llama.cpp docs will note them - check the repo.)
  3. Acquire/model quantization
    • Ideally get a prequantized GGUF or ggml model for ARM (e.g., quantized Llama2 7B Q4).
    • If you only have FP16 weights, use community tools to quantize to Q4/Q4_0 or INT4 (autoGPTQ or ggml quantization). This usually requires a separate x86/large machine for conversion.
    • Example run command for a ggml model:
      ./main -m /path/to/model.gguf -t 4 -c 2048
      
      • -t threads, -c context size.
  4. (Optional) Use textgenerationwebui for a local REST API
  5. Test locally and measure latency. If response times are acceptable, expose the web UI / API on the Pi and point ARC -> Autonomous AI to it.

If you want to proceed with Ollama specifically on Pi

  • Check Ollama’s ARM Linux package and installation instructions at https://ollama.com/docs. If they provide an ARM build for Linux:
    • Install Ollama on Pi.
    • Pull an INT4 quantized model that Ollama supports.
    • Run: ollama serve or whatever the Ollama daemon uses; confirm OpenAI-compatible endpoints.
    • In ARC Autonomous AI: set Base Domain to http://raspberrypi:PORT (see ARC instructions below).

How to connect Autonomous AI in ARC to your Pi LLM server

  1. Start your local server (textgenerationwebui, LocalAI, or Ollama) and confirm it answers a simple HTTP request on the LAN.
  2. In ARC:
  3. Test from ARC UI Send button or via script:
    • Example JavaScript script for ARC:
      // Send a prompt to Autonomous AI
      controlCommand("Autonomous AI", "Send", "Hello robot, introduce yourself.");
      // Wait a bit and then print the response variable
      sleep(2000);
      print("AI response:", getVar("$AutonomousAIResponse", "no response"));
      
  4. In the Autonomous AI Conversation Response Script (kebab menu -> Scripts -> Conversation Response Script) create the logic to:
    • run TTS (Audio.say or your TTS robot skill),
    • call servo actions (Movement.* or ControlCommand to other skills),
    • enforce safety checks (stop if $IsSpeaking or safety var).

Other practical tips

  • Start small: test with 7B Q4 models first. If latency is acceptable, try 13B quantized.
  • If conversion to INT4 is necessary, use an x86 machine for quantization (auto-gptq), then copy the quantized model to Pi.
  • Use offloading/hybrid approach if response quality/latency becomes an issue: run models on a local GPU machine and keep Pi as sensor/controller.
  • Use offline STT (whisper.cpp) and offline TTS (Coqui TTS) so whole system remains local.

Useful links

Questions for you (so I can give a precise stepbystep)

  1. Which OS are you running on the Pi 5 (Raspberry Pi OS 64-bit, Ubuntu 24.04, etc.)?
  2. Do you require using Ollama specifically, or would you accept llama.cpp / textgenerationwebui (often faster on Pi)?
  3. Do you already have any quantized model files (7B/13B) or will you need help obtaining/quantizing them?
  4. Are you comfortable running Docker on the Pi, or do you prefer native installs?
  5. Do you want me to produce the exact ARC Autonomous AI settings + a sample Conversation Response Script to perform TTS + servo movement safely?

If you tell me your OS and whether you prefer Ollama or llama.cpp, I’ll produce an exact stepbystep install + ARC configuration guide tailored to your Pi 5 and your goal of INT4 inference.


I am a customer support GPT-powered AI and still learning. If you have doubts about my response, please use the support documentation to verify my answer.