This Unity project provides a WebSocket-based client for interacting with a LLaVA server — a multimodal language model capable of processing text and images. Designed for use in VR/AR or mobile environments (e.g., Meta Quest), it enables real-time prompt submission and response handling, with optional image input and online TTS playback(WIt.ai).
- 🔌 Connects to a LLaVA server via WebSocket (
ws://...) - 🖼️ Sends text prompts and optional image bytes (JPG)
- 🔐 Includes SHA256 hashing for image verification
- 🔊 Optional TTS playback via
AndroidTTSintegration - 🧠 Parses and displays JSON responses in Unity UI
- 🧪 Includes ping test for server connectivity
QuestLLaVAClient.cs: Main MonoBehaviour script for WebSocket communicationLLaVAResponse: Response structure for parsed server outputLlavaHeader: Struct for request metadata (prompt, image length, temperature, etc.)ServerReply: Lightweight helper for simplified JSON parsing
- Capture or load a JPG image
- Construct a
LlavaHeaderwith metadata - Send header + image bytes over WebSocket
- Receive and parse the response
- Display result in UI and optionally speak it via TTS
- Attach
QuestLLaVAClientto a Unity GameObject - Assign
TMP_InputFieldandTextMeshProUGUIfor UI (optional) - Set
serverUrito your LLaVA server (e.g.,ws://192.168.2.29:19111/llava) - Call
SendPrompt()orSendPromptWithImageBytes()to trigger interaction
Use PingServer() to verify connectivity with your LLaVA server. This sends an HTTP GET to /ping and logs the result.
If AndroidTTS is assigned in the inspector, responses will be spoken aloud. Toggle showRawJsonInUI to control whether full JSON or parsed text is shown.
client.SendPrompt("Describe this image");📦 Requirements Unity 2021+ TextMeshPro AndroidTTS (optional) LLaVA server running with WebSocket support 📜 License MIT — feel free to modify and extend for your own projects. 🙌 Credits Inspired by the LLaVA project and designed for real-time multimodal interaction in Unity environments. Code
Let me know if you want to add setup instructions, screenshots, or usage demos. I can also help you package this for Unity Asset Store or GitHub release.
This project implements a lightweight WebSocket server in C# that accepts multimodal prompts (text + optional image), forwards them to a local LLM (e.g., llama.cpp or LLaVA), and optionally returns synthesized speech using KokoroSharp — a fast, local text-to-speech engine.
- 🌐 WebSocket endpoint
/llavafor real-time prompt exchange - 🖼️ Supports image + text input via binary WebSocket frames
- 🔁 Forwards prompts to a local LLM server (
/v1/chat/completions) - 🔊 Optional TTS synthesis via KokoroSharp or macOS
say - 📦 HTTP endpoint
/ttsfor direct text-to-speech conversion - 🔔
/pingand/beependpoints for diagnostics and testing - 🧠 Uses reflection to dynamically load KokoroSharp without hard dependencies
- Client connects and sends a JSON header:
{ "prompt": "Describe this image", "image_len": 123456, "temperature": 0.3, "max_tokens": 256 }
- Fix Turn click prompting using meta joystick.
- fix On device TTS
- Include Speech commands( On device ASR)
Portions of this project’s camera overlay logic were adapted from Meta’s PassthroughCameraApiSamples, specifically the CameraToWorldManager component included in the Meta XR PCSA Mixed Reality Starter Samples. Original samples © Meta Platforms, Inc. and affiliates. Used under the terms of the Meta SDK license. All other code, including the QuestLLaVAClient and multimodal integration, is original and authored by Gildas.