Skip to content

Real-time lip-sync avatar system that works with any voice source. Upload a face, connect an audio stream, watch it speak.

Notifications You must be signed in to change notification settings

marisombra-dev/Universal-Avatar-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

# Universal Avatar Engine

Real-time lip-sync avatar system that works with **any voice source**. Upload a face, connect an audio stream, watch it speak.

---

## โœจ Features

- **Universal**: Works with Claude, ChatGPT, Grok, local models, TTS engines, audio files
- **Real-time**: Zero-latency lip-sync via frequency analysis
- **Realistic**: WebGL mesh deformation, not layered overlays
- **Accurate**: MediaPipe face mesh with 468 landmarks
- **Smooth**: Spring physics for natural movement
- **10 visemes**: Full phoneme coverage (ah, ee, oh, oo, er, f, m, l, w, th)

---

## ๐Ÿ—๏ธ Architecture

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Upload Image   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  FaceMeshAnalyzer       โ”‚  โ† MediaPipe Face Mesh
โ”‚  - Detect landmarks     โ”‚
โ”‚  - Extract mouth region โ”‚
โ”‚  - Build deform rig     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  MeshDeformer (WebGL)   โ”‚  โ† Mesh warping
โ”‚  - Load face texture    โ”‚
โ”‚  - Create vertex grid   โ”‚
โ”‚  - Apply deformations   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Audio Stream           โ”‚  โ† Any source
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  PhonemeDetector        โ”‚  โ† Frequency analysis
โ”‚  - Analyze spectrum     โ”‚
โ”‚  - Detect visemes       โ”‚
โ”‚  - Track volume         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Spring Physics         โ”‚  โ† Smooth interpolation
โ”‚  - Target viseme        โ”‚
โ”‚  - Velocity damping     โ”‚
โ”‚  - Natural motion       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Render Loop            โ”‚  โ† 60 FPS animation
โ”‚  - Update deformation   โ”‚
โ”‚  - Render WebGL mesh    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

---

## ๐Ÿš€ Quick Start

### Installation

```bash
npm install @mediapipe/tasks-vision
```

### Basic Usage

```tsx
import AvatarEngine from './AvatarEngine';

function App() {
  const [audioStream, setAudioStream] = useState<MediaStream | null>(null);
  
  return (
    <AvatarEngine
      imageUrl="/path/to/face.jpg"
      audioStream={audioStream}
      isPlaying={true}
      intensity={0.8}
      debug={false}
    />
  );
}
```

---

## ๐Ÿ”Œ Integration Examples

### Claude Voice Mode

```tsx
// Capture Claude's voice output
const audioContext = new AudioContext();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

<AvatarEngine
  imageUrl="/claude-face.jpg"
  audioStream={stream}
  isPlaying={claudeIsSpeaking}
/>
```

### ChatGPT Voice

```tsx
// Same approach - just point to ChatGPT's audio stream
const chatGPTStream = getChatGPTAudioStream();

<AvatarEngine
  imageUrl="/chatgpt-face.jpg"
  audioStream={chatGPTStream}
  isPlaying={chatGPTIsSpeaking}
/>
```

### Local TTS (Web Speech API)

```tsx
const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();

const utterance = new SpeechSynthesisUtterance("Hello world");
speechSynthesis.speak(utterance);

<AvatarEngine
  imageUrl="/avatar.jpg"
  audioStream={destination.stream}
  isPlaying={true}
/>
```

### Audio File

```tsx
const audioElement = new Audio('/speech.mp3');
const audioContext = new AudioContext();
const source = audioContext.createMediaElementSource(audioElement);
const destination = audioContext.createMediaStreamDestination();

source.connect(destination);
source.connect(audioContext.destination); // Also play through speakers

audioElement.play();

<AvatarEngine
  imageUrl="/face.jpg"
  audioStream={destination.stream}
  isPlaying={!audioElement.paused}
/>
```

---

## ๐ŸŽจ How It Works

### 1. Face Analysis (One-time)

When you upload an image:
1. MediaPipe detects 468 facial landmarks
2. Mouth region extracted (upper lip, lower lip, corners, jaw)
3. WebGL mesh created with deformable vertices
4. Result cached for real-time animation

### 2. Audio โ†’ Viseme Pipeline (Real-time)

Every frame (~16ms):
1. **Frequency analysis**: FFT โ†’ 4 frequency bands
2. **Viseme detection**: Pattern matching on band ratios
3. **Spring physics**: Smooth interpolation to target shape
4. **Mesh deformation**: Vertices displaced based on viseme
5. **WebGL render**: Warped face drawn to canvas

### 3. Phoneme โ†’ Viseme Mapping

| Phoneme | Example | Jaw | Lips | Shape |
|---------|---------|-----|------|-------|
| AH | "father" | Open | Neutral | Wide opening |
| EE | "see" | Closed | Spread | Horizontal stretch |
| OH | "go" | Mid | Round | Moderate round |
| OO | "boot" | Closed | Tight | Pucker |
| ER | "bird" | Mid | Neutral | R-colored |
| F/V | "five" | Closed | Contact | Upper teeth on lower lip |
| M/B/P | "mom" | Closed | Closed | Bilabial closure |
| L/D/T | "let" | Open | Spread | Tongue visible |
| W | "we" | Closed | Round | Rounded forward |
| TH | "the" | Open | Spread | Dental |

---

## ๐ŸŽ›๏ธ Configuration

### Props

```tsx
interface AvatarEngineProps {
  imageUrl: string;              // Face image URL
  audioStream: MediaStream | null;  // Audio source
  isPlaying: boolean;             // Voice activity flag
  intensity?: number;             // Animation intensity (0-1, default 0.8)
  debug?: boolean;                // Show debug overlay
  onFaceAnalyzed?: (rig: FaceRig) => void;  // Face analysis callback
  onError?: (error: Error) => void;  // Error handler
}
```

### Tuning

**Spring physics** (in `MeshDeformer.ts`):
```ts
const springStrength = 0.18;  // Higher = snappier (0.1 - 0.3)
const damping = 0.7;          // Higher = less bouncy (0.5 - 0.9)
```

**Viseme deformations** (in `MeshDeformer.ts`):
```ts
const VISEME_DEFORMATIONS: Record<Viseme, VisemeDeformation> = {
  ah: { 
    jawDrop: 0.7,      // 0-1: How far jaw drops
    lipSpread: 0.3,    // 0-1: Horizontal stretch
    // ... etc
  }
}
```

**Phoneme detection** (in `PhonemeDetector.ts`):
```ts
// Adjust frequency band thresholds
if (high > 0.35 && mid > 0.30 && low < 0.20) {
  return 'ee';  // Tune these values for accuracy
}
```

---

## ๐Ÿงช Testing

### Test with different faces:
- Front-facing portraits work best
- Clear lighting, neutral expression
- Face should fill ~60-80% of frame
- JPG/PNG, reasonable resolution (500-2000px)

### Test with different audio:
- Male vs female voices (different frequency ranges)
- Fast vs slow speech (spring physics handles both)
- Whisper vs shout (volume scaling)
- Different languages (phonemes are universal)

### Debug mode:
```tsx
<AvatarEngine debug={true} />
```
Shows:
- Current viseme
- Volume level
- Face landmarks count
- Frame timing

---

## ๐Ÿ› Troubleshooting

### "No face detected"
- Ensure face is front-facing
- Check image quality
- Try different lighting
- Face must be clearly visible

### Mouth doesn't move
- Verify `audioStream` is connected
- Check `isPlaying` is true
- Enable debug mode to see viseme detection
- Confirm audio has sufficient volume

### Movement looks unnatural
- Decrease `intensity` (try 0.5 - 0.7)
- Adjust spring physics constants
- Check if face landmarks are accurate
- Try different deformation parameters

### Performance issues
- Reduce canvas size
- Lower mesh grid resolution (in `createMeshGeometry`)
- Disable debug overlay
- Use smaller source images

---

## ๐Ÿ”ฎ Future Enhancements

### Accuracy improvements:
- [ ] ML-based phoneme detection (Whisper, Wav2Vec)
- [ ] Co-articulation modeling (phoneme transitions)
- [ ] Prosody detection (emphasis, emotion)

### Visual improvements:
- [ ] Eye blinks synced to speech pauses
- [ ] Head motion (nods, turns)
- [ ] Facial expressions (smile, frown)
- [ ] Eyebrow movement

### Technical improvements:
- [ ] GPU acceleration for face detection
- [ ] Web Worker for audio analysis
- [ ] Cached deformation meshes
- [ ] Progressive face loading

---

## ๐Ÿ“ฆ File Structure

```
/
โ”œโ”€โ”€ FaceMeshAnalyzer.ts      # MediaPipe face detection
โ”œโ”€โ”€ PhonemeDetector.ts       # Audio โ†’ viseme mapping
โ”œโ”€โ”€ MeshDeformer.ts          # WebGL mesh warping
โ”œโ”€โ”€ AvatarEngine.tsx         # Main React component
โ”œโ”€โ”€ AvatarEngineDemo.tsx     # Example usage
โ””โ”€โ”€ README.md                # This file
```

---

## ๐ŸŽฏ Use Cases

- **Voice assistants**: Visualize AI responses
- **Video calls**: Animate static profile pics
- **Accessibility**: Visual speech for hearing impaired
- **Gaming**: NPC dialogue animation
- **Education**: Language learning pronunciation
- **Entertainment**: Deepfake-lite (ethical use only)

---

## ๐Ÿ“ License

MIT - Do whatever you want with it.

---

## ๐Ÿ™ Credits

Built with:
- [MediaPipe Face Mesh](https://google.github.io/mediapipe/solutions/face_mesh.html) by Google
- Web Audio API
- WebGL
- React

---

**Made with ๐Ÿ”ฅ for universal avatar animation**

About

Real-time lip-sync avatar system that works with any voice source. Upload a face, connect an audio stream, watch it speak.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published