Scene narration with Metaquest3

🧠 QuestLLaVAClient: Unity WebSocket Client for LLaVA

This Unity project provides a WebSocket-based client for interacting with a LLaVA server — a multimodal language model capable of processing text and images. Designed for use in VR/AR or mobile environments (e.g., Meta Quest), it enables real-time prompt submission and response handling, with optional image input and online TTS playback(WIt.ai).

🚀 Features

🔌 Connects to a LLaVA server via WebSocket (ws://...)
🖼️ Sends text prompts and optional image bytes (JPG)
🔐 Includes SHA256 hashing for image verification
🔊 Optional TTS playback via AndroidTTS integration
🧠 Parses and displays JSON responses in Unity UI
🧪 Includes ping test for server connectivity

🧰 Components

QuestLLaVAClient.cs: Main MonoBehaviour script for WebSocket communication
LLaVAResponse: Response structure for parsed server output
LlavaHeader: Struct for request metadata (prompt, image length, temperature, etc.)
ServerReply: Lightweight helper for simplified JSON parsing

🖼️ Image + Prompt Flow

Capture or load a JPG image
Construct a LlavaHeader with metadata
Send header + image bytes over WebSocket
Receive and parse the response
Display result in UI and optionally speak it via TTS

🛠️ Usage

Attach QuestLLaVAClient to a Unity GameObject
Assign TMP_InputField and TextMeshProUGUI for UI (optional)
Set serverUri to your LLaVA server (e.g., ws://192.168.2.29:19111/llava)
Call SendPrompt() or SendPromptWithImageBytes() to trigger interaction

📡 Ping Test

Use PingServer() to verify connectivity with your LLaVA server. This sends an HTTP GET to /ping and logs the result.

🗣️ TTS Integration

If AndroidTTS is assigned in the inspector, responses will be spoken aloud. Toggle showRawJsonInUI to control whether full JSON or parsed text is shown.

🧪 Example Prompt

client.SendPrompt("Describe this image");

📦 Requirements Unity 2021+ TextMeshPro AndroidTTS (optional) LLaVA server running with WebSocket support 📜 License MIT — feel free to modify and extend for your own projects. 🙌 Credits Inspired by the LLaVA project and designed for real-time multimodal interaction in Unity environments. Code

Let me know if you want to add setup instructions, screenshots, or usage demos. I can also help you package this for Unity Asset Store or GitHub release.

Scene 1

Scene 2

Scene 3

🧠 LLaVA WebSocket Server with KokoroSharp TTS

This project implements a lightweight WebSocket server in C# that accepts multimodal prompts (text + optional image), forwards them to a local LLM (e.g., llama.cpp or LLaVA), and optionally returns synthesized speech using KokoroSharp — a fast, local text-to-speech engine.

🚀 Features

🌐 WebSocket endpoint /llava for real-time prompt exchange
🖼️ Supports image + text input via binary WebSocket frames
🔁 Forwards prompts to a local LLM server (/v1/chat/completions)
🔊 Optional TTS synthesis via KokoroSharp or macOS say
📦 HTTP endpoint /tts for direct text-to-speech conversion
🔔 /ping and /beep endpoints for diagnostics and testing
🧠 Uses reflection to dynamically load KokoroSharp without hard dependencies

📡 WebSocket Flow (`/llava`)

Client connects and sends a JSON header:

{
  "prompt": "Describe this image",
  "image_len": 123456,
  "temperature": 0.3,
  "max_tokens": 256
}

Improvements

Fix Turn click prompting using meta joystick.
fix On device TTS
Include Speech commands( On device ASR)

📜 Attribution

Portions of this project’s camera overlay logic were adapted from Meta’s PassthroughCameraApiSamples, specifically the CameraToWorldManager component included in the Meta XR PCSA Mixed Reality Starter Samples. Original samples © Meta Platforms, Inc. and affiliates. Used under the terms of the Meta SDK license. All other code, including the QuestLLaVAClient and multimodal integration, is original and authored by Gildas.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Program.cs		Program.cs
QuestLlavaClient.cs		QuestLlavaClient.cs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scene narration with Metaquest3

🧠 QuestLLaVAClient: Unity WebSocket Client for LLaVA

🚀 Features

🧰 Components

🖼️ Image + Prompt Flow

🛠️ Usage

📡 Ping Test

🗣️ TTS Integration

🧪 Example Prompt

Scene 1

Scene 2

Scene 3

🧠 LLaVA WebSocket Server with KokoroSharp TTS

🚀 Features

📡 WebSocket Flow (`/llava`)

Improvements

📜 Attribution

About

Uh oh!

Releases

Packages

Languages

AubinGil/Scene-narration-with-MetaQuest-3

Folders and files

Latest commit

History

Repository files navigation

Scene narration with Metaquest3

🧠 QuestLLaVAClient: Unity WebSocket Client for LLaVA

🚀 Features

🧰 Components

🖼️ Image + Prompt Flow

🛠️ Usage

📡 Ping Test

🗣️ TTS Integration

🧪 Example Prompt

Scene 1

Scene 2

Scene 3

🧠 LLaVA WebSocket Server with KokoroSharp TTS

🚀 Features

📡 WebSocket Flow (/llava)

Improvements

📜 Attribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

📡 WebSocket Flow (`/llava`)

Packages