Skip to content

Using Large Language Vision Assistant(Llava) for scene understanding on MetaQuest 3(VR)

Notifications You must be signed in to change notification settings

AubinGil/Scene-narration-with-MetaQuest-3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Scene narration with Metaquest3

🧠 QuestLLaVAClient: Unity WebSocket Client for LLaVA

This Unity project provides a WebSocket-based client for interacting with a LLaVA server — a multimodal language model capable of processing text and images. Designed for use in VR/AR or mobile environments (e.g., Meta Quest), it enables real-time prompt submission and response handling, with optional image input and online TTS playback(WIt.ai).


🚀 Features

  • 🔌 Connects to a LLaVA server via WebSocket (ws://...)
  • 🖼️ Sends text prompts and optional image bytes (JPG)
  • 🔐 Includes SHA256 hashing for image verification
  • 🔊 Optional TTS playback via AndroidTTS integration
  • 🧠 Parses and displays JSON responses in Unity UI
  • 🧪 Includes ping test for server connectivity

🧰 Components

  • QuestLLaVAClient.cs: Main MonoBehaviour script for WebSocket communication
  • LLaVAResponse: Response structure for parsed server output
  • LlavaHeader: Struct for request metadata (prompt, image length, temperature, etc.)
  • ServerReply: Lightweight helper for simplified JSON parsing

🖼️ Image + Prompt Flow

  1. Capture or load a JPG image
  2. Construct a LlavaHeader with metadata
  3. Send header + image bytes over WebSocket
  4. Receive and parse the response
  5. Display result in UI and optionally speak it via TTS

🛠️ Usage

  1. Attach QuestLLaVAClient to a Unity GameObject
  2. Assign TMP_InputField and TextMeshProUGUI for UI (optional)
  3. Set serverUri to your LLaVA server (e.g., ws://192.168.2.29:19111/llava)
  4. Call SendPrompt() or SendPromptWithImageBytes() to trigger interaction

📡 Ping Test

Use PingServer() to verify connectivity with your LLaVA server. This sends an HTTP GET to /ping and logs the result.


🗣️ TTS Integration

If AndroidTTS is assigned in the inspector, responses will be spoken aloud. Toggle showRawJsonInUI to control whether full JSON or parsed text is shown.


🧪 Example Prompt

client.SendPrompt("Describe this image");

📦 Requirements Unity 2021+ TextMeshPro AndroidTTS (optional) LLaVA server running with WebSocket support 📜 License MIT — feel free to modify and extend for your own projects. 🙌 Credits Inspired by the LLaVA project and designed for real-time multimodal interaction in Unity environments. Code


Let me know if you want to add setup instructions, screenshots, or usage demos. I can also help you package this for Unity Asset Store or GitHub release.

Scene 1

image

Scene 2

image

Scene 3

image

🧠 LLaVA WebSocket Server with KokoroSharp TTS

This project implements a lightweight WebSocket server in C# that accepts multimodal prompts (text + optional image), forwards them to a local LLM (e.g., llama.cpp or LLaVA), and optionally returns synthesized speech using KokoroSharp — a fast, local text-to-speech engine.


🚀 Features

  • 🌐 WebSocket endpoint /llava for real-time prompt exchange
  • 🖼️ Supports image + text input via binary WebSocket frames
  • 🔁 Forwards prompts to a local LLM server (/v1/chat/completions)
  • 🔊 Optional TTS synthesis via KokoroSharp or macOS say
  • 📦 HTTP endpoint /tts for direct text-to-speech conversion
  • 🔔 /ping and /beep endpoints for diagnostics and testing
  • 🧠 Uses reflection to dynamically load KokoroSharp without hard dependencies

📡 WebSocket Flow (/llava)

  1. Client connects and sends a JSON header:
    {
      "prompt": "Describe this image",
      "image_len": 123456,
      "temperature": 0.3,
      "max_tokens": 256
    }
    

Improvements

  1. Fix Turn click prompting using meta joystick.
  2. fix On device TTS
  3. Include Speech commands( On device ASR)

📜 Attribution

Portions of this project’s camera overlay logic were adapted from Meta’s PassthroughCameraApiSamples, specifically the CameraToWorldManager component included in the Meta XR PCSA Mixed Reality Starter Samples. Original samples © Meta Platforms, Inc. and affiliates. Used under the terms of the Meta SDK license. All other code, including the QuestLLaVAClient and multimodal integration, is original and authored by Gildas.

About

Using Large Language Vision Assistant(Llava) for scene understanding on MetaQuest 3(VR)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages