-
Notifications
You must be signed in to change notification settings - Fork 0
marisombra-dev/Universal-Avatar-Engine
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
ย | ย | |||
ย | ย | |||
ย | ย | |||
ย | ย | |||
ย | ย | |||
ย | ย | |||
ย | ย | |||
ย | ย | |||
Repository files navigation
# Universal Avatar Engine
Real-time lip-sync avatar system that works with **any voice source**. Upload a face, connect an audio stream, watch it speak.
---
## โจ Features
- **Universal**: Works with Claude, ChatGPT, Grok, local models, TTS engines, audio files
- **Real-time**: Zero-latency lip-sync via frequency analysis
- **Realistic**: WebGL mesh deformation, not layered overlays
- **Accurate**: MediaPipe face mesh with 468 landmarks
- **Smooth**: Spring physics for natural movement
- **10 visemes**: Full phoneme coverage (ah, ee, oh, oo, er, f, m, l, w, th)
---
## ๐๏ธ Architecture
```
โโโโโโโโโโโโโโโโโโโ
โ Upload Image โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FaceMeshAnalyzer โ โ MediaPipe Face Mesh
โ - Detect landmarks โ
โ - Extract mouth region โ
โ - Build deform rig โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MeshDeformer (WebGL) โ โ Mesh warping
โ - Load face texture โ
โ - Create vertex grid โ
โ - Apply deformations โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Audio Stream โ โ Any source
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PhonemeDetector โ โ Frequency analysis
โ - Analyze spectrum โ
โ - Detect visemes โ
โ - Track volume โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Spring Physics โ โ Smooth interpolation
โ - Target viseme โ
โ - Velocity damping โ
โ - Natural motion โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Render Loop โ โ 60 FPS animation
โ - Update deformation โ
โ - Render WebGL mesh โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## ๐ Quick Start
### Installation
```bash
npm install @mediapipe/tasks-vision
```
### Basic Usage
```tsx
import AvatarEngine from './AvatarEngine';
function App() {
const [audioStream, setAudioStream] = useState<MediaStream | null>(null);
return (
<AvatarEngine
imageUrl="/path/to/face.jpg"
audioStream={audioStream}
isPlaying={true}
intensity={0.8}
debug={false}
/>
);
}
```
---
## ๐ Integration Examples
### Claude Voice Mode
```tsx
// Capture Claude's voice output
const audioContext = new AudioContext();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
<AvatarEngine
imageUrl="/claude-face.jpg"
audioStream={stream}
isPlaying={claudeIsSpeaking}
/>
```
### ChatGPT Voice
```tsx
// Same approach - just point to ChatGPT's audio stream
const chatGPTStream = getChatGPTAudioStream();
<AvatarEngine
imageUrl="/chatgpt-face.jpg"
audioStream={chatGPTStream}
isPlaying={chatGPTIsSpeaking}
/>
```
### Local TTS (Web Speech API)
```tsx
const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();
const utterance = new SpeechSynthesisUtterance("Hello world");
speechSynthesis.speak(utterance);
<AvatarEngine
imageUrl="/avatar.jpg"
audioStream={destination.stream}
isPlaying={true}
/>
```
### Audio File
```tsx
const audioElement = new Audio('/speech.mp3');
const audioContext = new AudioContext();
const source = audioContext.createMediaElementSource(audioElement);
const destination = audioContext.createMediaStreamDestination();
source.connect(destination);
source.connect(audioContext.destination); // Also play through speakers
audioElement.play();
<AvatarEngine
imageUrl="/face.jpg"
audioStream={destination.stream}
isPlaying={!audioElement.paused}
/>
```
---
## ๐จ How It Works
### 1. Face Analysis (One-time)
When you upload an image:
1. MediaPipe detects 468 facial landmarks
2. Mouth region extracted (upper lip, lower lip, corners, jaw)
3. WebGL mesh created with deformable vertices
4. Result cached for real-time animation
### 2. Audio โ Viseme Pipeline (Real-time)
Every frame (~16ms):
1. **Frequency analysis**: FFT โ 4 frequency bands
2. **Viseme detection**: Pattern matching on band ratios
3. **Spring physics**: Smooth interpolation to target shape
4. **Mesh deformation**: Vertices displaced based on viseme
5. **WebGL render**: Warped face drawn to canvas
### 3. Phoneme โ Viseme Mapping
| Phoneme | Example | Jaw | Lips | Shape |
|---------|---------|-----|------|-------|
| AH | "father" | Open | Neutral | Wide opening |
| EE | "see" | Closed | Spread | Horizontal stretch |
| OH | "go" | Mid | Round | Moderate round |
| OO | "boot" | Closed | Tight | Pucker |
| ER | "bird" | Mid | Neutral | R-colored |
| F/V | "five" | Closed | Contact | Upper teeth on lower lip |
| M/B/P | "mom" | Closed | Closed | Bilabial closure |
| L/D/T | "let" | Open | Spread | Tongue visible |
| W | "we" | Closed | Round | Rounded forward |
| TH | "the" | Open | Spread | Dental |
---
## ๐๏ธ Configuration
### Props
```tsx
interface AvatarEngineProps {
imageUrl: string; // Face image URL
audioStream: MediaStream | null; // Audio source
isPlaying: boolean; // Voice activity flag
intensity?: number; // Animation intensity (0-1, default 0.8)
debug?: boolean; // Show debug overlay
onFaceAnalyzed?: (rig: FaceRig) => void; // Face analysis callback
onError?: (error: Error) => void; // Error handler
}
```
### Tuning
**Spring physics** (in `MeshDeformer.ts`):
```ts
const springStrength = 0.18; // Higher = snappier (0.1 - 0.3)
const damping = 0.7; // Higher = less bouncy (0.5 - 0.9)
```
**Viseme deformations** (in `MeshDeformer.ts`):
```ts
const VISEME_DEFORMATIONS: Record<Viseme, VisemeDeformation> = {
ah: {
jawDrop: 0.7, // 0-1: How far jaw drops
lipSpread: 0.3, // 0-1: Horizontal stretch
// ... etc
}
}
```
**Phoneme detection** (in `PhonemeDetector.ts`):
```ts
// Adjust frequency band thresholds
if (high > 0.35 && mid > 0.30 && low < 0.20) {
return 'ee'; // Tune these values for accuracy
}
```
---
## ๐งช Testing
### Test with different faces:
- Front-facing portraits work best
- Clear lighting, neutral expression
- Face should fill ~60-80% of frame
- JPG/PNG, reasonable resolution (500-2000px)
### Test with different audio:
- Male vs female voices (different frequency ranges)
- Fast vs slow speech (spring physics handles both)
- Whisper vs shout (volume scaling)
- Different languages (phonemes are universal)
### Debug mode:
```tsx
<AvatarEngine debug={true} />
```
Shows:
- Current viseme
- Volume level
- Face landmarks count
- Frame timing
---
## ๐ Troubleshooting
### "No face detected"
- Ensure face is front-facing
- Check image quality
- Try different lighting
- Face must be clearly visible
### Mouth doesn't move
- Verify `audioStream` is connected
- Check `isPlaying` is true
- Enable debug mode to see viseme detection
- Confirm audio has sufficient volume
### Movement looks unnatural
- Decrease `intensity` (try 0.5 - 0.7)
- Adjust spring physics constants
- Check if face landmarks are accurate
- Try different deformation parameters
### Performance issues
- Reduce canvas size
- Lower mesh grid resolution (in `createMeshGeometry`)
- Disable debug overlay
- Use smaller source images
---
## ๐ฎ Future Enhancements
### Accuracy improvements:
- [ ] ML-based phoneme detection (Whisper, Wav2Vec)
- [ ] Co-articulation modeling (phoneme transitions)
- [ ] Prosody detection (emphasis, emotion)
### Visual improvements:
- [ ] Eye blinks synced to speech pauses
- [ ] Head motion (nods, turns)
- [ ] Facial expressions (smile, frown)
- [ ] Eyebrow movement
### Technical improvements:
- [ ] GPU acceleration for face detection
- [ ] Web Worker for audio analysis
- [ ] Cached deformation meshes
- [ ] Progressive face loading
---
## ๐ฆ File Structure
```
/
โโโ FaceMeshAnalyzer.ts # MediaPipe face detection
โโโ PhonemeDetector.ts # Audio โ viseme mapping
โโโ MeshDeformer.ts # WebGL mesh warping
โโโ AvatarEngine.tsx # Main React component
โโโ AvatarEngineDemo.tsx # Example usage
โโโ README.md # This file
```
---
## ๐ฏ Use Cases
- **Voice assistants**: Visualize AI responses
- **Video calls**: Animate static profile pics
- **Accessibility**: Visual speech for hearing impaired
- **Gaming**: NPC dialogue animation
- **Education**: Language learning pronunciation
- **Entertainment**: Deepfake-lite (ethical use only)
---
## ๐ License
MIT - Do whatever you want with it.
---
## ๐ Credits
Built with:
- [MediaPipe Face Mesh](https://google.github.io/mediapipe/solutions/face_mesh.html) by Google
- Web Audio API
- WebGL
- React
---
**Made with ๐ฅ for universal avatar animation**
About
Real-time lip-sync avatar system that works with any voice source. Upload a face, connect an audio stream, watch it speak.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published