GitHub - marisombra-dev/Universal-Avatar-Engine: Real-time lip-sync avatar system that works with any voice source. Upload a face, connect an audio stream, watch it speak.

marisombra-dev / Universal-Avatar-Engine Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Real-time lip-sync avatar system that works with any voice source. Upload a face, connect an audio stream, watch it speak.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
AvatarEngine		AvatarEngine
Avatarenginedemo		Avatarenginedemo
Fasemeshanalyzer		Fasemeshanalyzer
Meshdeformer		Meshdeformer
Package		Package
Phonemedetector		Phonemedetector
README		README
Types		Types

Repository files navigation

# Universal Avatar Engine

Real-time lip-sync avatar system that works with **any voice source**. Upload a face, connect an audio stream, watch it speak.

---

## ✨ Features

- **Universal**: Works with Claude, ChatGPT, Grok, local models, TTS engines, audio files
- **Real-time**: Zero-latency lip-sync via frequency analysis
- **Realistic**: WebGL mesh deformation, not layered overlays
- **Accurate**: MediaPipe face mesh with 468 landmarks
- **Smooth**: Spring physics for natural movement
- **10 visemes**: Full phoneme coverage (ah, ee, oh, oo, er, f, m, l, w, th)

---

## 🏗️ Architecture

```
┌─────────────────┐
│  Upload Image   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│  FaceMeshAnalyzer       │  ← MediaPipe Face Mesh
│  - Detect landmarks     │
│  - Extract mouth region │
│  - Build deform rig     │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  MeshDeformer (WebGL)   │  ← Mesh warping
│  - Load face texture    │
│  - Create vertex grid   │
│  - Apply deformations   │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Audio Stream           │  ← Any source
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  PhonemeDetector        │  ← Frequency analysis
│  - Analyze spectrum     │
│  - Detect visemes       │
│  - Track volume         │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Spring Physics         │  ← Smooth interpolation
│  - Target viseme        │
│  - Velocity damping     │
│  - Natural motion       │
└────────┬────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Render Loop            │  ← 60 FPS animation
│  - Update deformation   │
│  - Render WebGL mesh    │
└─────────────────────────┘
```

---

## 🚀 Quick Start

### Installation

```bash
npm install @mediapipe/tasks-vision
```

### Basic Usage

```tsx
import AvatarEngine from './AvatarEngine';

function App() {
  const [audioStream, setAudioStream] = useState<MediaStream | null>(null);
  
  return (
    <AvatarEngine
      imageUrl="/path/to/face.jpg"
      audioStream={audioStream}
      isPlaying={true}
      intensity={0.8}
      debug={false}
    />
  );
}
```

---

## 🔌 Integration Examples

### Claude Voice Mode

```tsx
// Capture Claude's voice output
const audioContext = new AudioContext();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

<AvatarEngine
  imageUrl="/claude-face.jpg"
  audioStream={stream}
  isPlaying={claudeIsSpeaking}
/>
```

### ChatGPT Voice

```tsx
// Same approach - just point to ChatGPT's audio stream
const chatGPTStream = getChatGPTAudioStream();

<AvatarEngine
  imageUrl="/chatgpt-face.jpg"
  audioStream={chatGPTStream}
  isPlaying={chatGPTIsSpeaking}
/>
```

### Local TTS (Web Speech API)

```tsx
const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();

const utterance = new SpeechSynthesisUtterance("Hello world");
speechSynthesis.speak(utterance);

<AvatarEngine
  imageUrl="/avatar.jpg"
  audioStream={destination.stream}
  isPlaying={true}
/>
```

### Audio File

```tsx
const audioElement = new Audio('/speech.mp3');
const audioContext = new AudioContext();
const source = audioContext.createMediaElementSource(audioElement);
const destination = audioContext.createMediaStreamDestination();

source.connect(destination);
source.connect(audioContext.destination); // Also play through speakers

audioElement.play();

<AvatarEngine
  imageUrl="/face.jpg"
  audioStream={destination.stream}
  isPlaying={!audioElement.paused}
/>
```

---

## 🎨 How It Works

### 1. Face Analysis (One-time)

When you upload an image:
1. MediaPipe detects 468 facial landmarks
2. Mouth region extracted (upper lip, lower lip, corners, jaw)
3. WebGL mesh created with deformable vertices
4. Result cached for real-time animation

### 2. Audio → Viseme Pipeline (Real-time)

Every frame (~16ms):
1. **Frequency analysis**: FFT → 4 frequency bands
2. **Viseme detection**: Pattern matching on band ratios
3. **Spring physics**: Smooth interpolation to target shape
4. **Mesh deformation**: Vertices displaced based on viseme
5. **WebGL render**: Warped face drawn to canvas

### 3. Phoneme → Viseme Mapping

| Phoneme | Example | Jaw | Lips | Shape |
|---------|---------|-----|------|-------|
| AH | "father" | Open | Neutral | Wide opening |
| EE | "see" | Closed | Spread | Horizontal stretch |
| OH | "go" | Mid | Round | Moderate round |
| OO | "boot" | Closed | Tight | Pucker |
| ER | "bird" | Mid | Neutral | R-colored |
| F/V | "five" | Closed | Contact | Upper teeth on lower lip |
| M/B/P | "mom" | Closed | Closed | Bilabial closure |
| L/D/T | "let" | Open | Spread | Tongue visible |
| W | "we" | Closed | Round | Rounded forward |
| TH | "the" | Open | Spread | Dental |

---

## 🎛️ Configuration

### Props

```tsx
interface AvatarEngineProps {
  imageUrl: string;              // Face image URL
  audioStream: MediaStream | null;  // Audio source
  isPlaying: boolean;             // Voice activity flag
  intensity?: number;             // Animation intensity (0-1, default 0.8)
  debug?: boolean;                // Show debug overlay
  onFaceAnalyzed?: (rig: FaceRig) => void;  // Face analysis callback
  onError?: (error: Error) => void;  // Error handler
}
```

### Tuning

**Spring physics** (in `MeshDeformer.ts`):
```ts
const springStrength = 0.18;  // Higher = snappier (0.1 - 0.3)
const damping = 0.7;          // Higher = less bouncy (0.5 - 0.9)
```

**Viseme deformations** (in `MeshDeformer.ts`):
```ts
const VISEME_DEFORMATIONS: Record<Viseme, VisemeDeformation> = {
  ah: { 
    jawDrop: 0.7,      // 0-1: How far jaw drops
    lipSpread: 0.3,    // 0-1: Horizontal stretch
    // ... etc
  }
}
```

**Phoneme detection** (in `PhonemeDetector.ts`):
```ts
// Adjust frequency band thresholds
if (high > 0.35 && mid > 0.30 && low < 0.20) {
  return 'ee';  // Tune these values for accuracy
}
```

---

## 🧪 Testing

### Test with different faces:
- Front-facing portraits work best
- Clear lighting, neutral expression
- Face should fill ~60-80% of frame
- JPG/PNG, reasonable resolution (500-2000px)

### Test with different audio:
- Male vs female voices (different frequency ranges)
- Fast vs slow speech (spring physics handles both)
- Whisper vs shout (volume scaling)
- Different languages (phonemes are universal)

### Debug mode:
```tsx
<AvatarEngine debug={true} />
```
Shows:
- Current viseme
- Volume level
- Face landmarks count
- Frame timing

---

## 🐛 Troubleshooting

### "No face detected"
- Ensure face is front-facing
- Check image quality
- Try different lighting
- Face must be clearly visible

### Mouth doesn't move
- Verify `audioStream` is connected
- Check `isPlaying` is true
- Enable debug mode to see viseme detection
- Confirm audio has sufficient volume

### Movement looks unnatural
- Decrease `intensity` (try 0.5 - 0.7)
- Adjust spring physics constants
- Check if face landmarks are accurate
- Try different deformation parameters

### Performance issues
- Reduce canvas size
- Lower mesh grid resolution (in `createMeshGeometry`)
- Disable debug overlay
- Use smaller source images

---

## 🔮 Future Enhancements

### Accuracy improvements:
- [ ] ML-based phoneme detection (Whisper, Wav2Vec)
- [ ] Co-articulation modeling (phoneme transitions)
- [ ] Prosody detection (emphasis, emotion)

### Visual improvements:
- [ ] Eye blinks synced to speech pauses
- [ ] Head motion (nods, turns)
- [ ] Facial expressions (smile, frown)
- [ ] Eyebrow movement

### Technical improvements:
- [ ] GPU acceleration for face detection
- [ ] Web Worker for audio analysis
- [ ] Cached deformation meshes
- [ ] Progressive face loading

---

## 📦 File Structure

```
/
├── FaceMeshAnalyzer.ts      # MediaPipe face detection
├── PhonemeDetector.ts       # Audio → viseme mapping
├── MeshDeformer.ts          # WebGL mesh warping
├── AvatarEngine.tsx         # Main React component
├── AvatarEngineDemo.tsx     # Example usage
└── README.md                # This file
```

---

## 🎯 Use Cases

- **Voice assistants**: Visualize AI responses
- **Video calls**: Animate static profile pics
- **Accessibility**: Visual speech for hearing impaired
- **Gaming**: NPC dialogue animation
- **Education**: Language learning pronunciation
- **Entertainment**: Deepfake-lite (ethical use only)

---

## 📝 License

MIT - Do whatever you want with it.

---

## 🙏 Credits

Built with:
- [MediaPipe Face Mesh](https://google.github.io/mediapipe/solutions/face_mesh.html) by Google
- Web Audio API
- WebGL
- React

---

**Made with 🔥 for universal avatar animation**