Gemini Studio

The Execution Layer for Agentic Video

Generative AI solved pixel generation. We're solving video production.

The deterministic engine that gives AI agents the hands to edit video.

The Bet: In 2 years, manual video editing will be obsolete for 90% of use cases. The bottleneck isn't AI models—it's the lack of infrastructure that lets agents actually edit. We're building that missing layer.

The Moat: Gemini Studio is the first video editor that understands your assets and clips semantically. The system organizes everything for you—no more renaming each file by hand or wasting hours in bins. Search your library in plain language (e.g. the drone shot over the water, the clip where the crowd cheers); the agent uses the same understanding to resolve which asset or clip you mean. Semantic understanding turns natural language into precise edits.

Demo & submission

Item	Link
Live demo	https://www.geminivideo.studio/
Repository	https://github.com/youneslaaroussi/geministudio

The Problem
The Solution
Why Not a Plugin? A Ground-Up Redesign
Why This Changes Everything
Gemini 3 Pro: The Reasoning Layer
Architecture
How the Execution Layer Works
Programmatic Motion Graphics: The Agent Writes Components, Not Templates
Scene Compiler: esbuild for Millisecond-Fast Compilation
Component plugins: data viz, maps, color, and noise
Motion Canvas: The Perfect Rendering Layer for LLMs
Autonomous Video Production: The Agent Can Watch Its Own Work
Tech Stack
Setup
Credits & Resources
License

The Problem: AI Can Generate Pixels, But It Can't Produce Video

Veo solved generation. But it didn't solve production.

A raw AI-generated clip is not a finished video. It has no narrative structure, no pacing, no intent. The bottleneck isn't the model—it's the lack of a rendering engine that can translate an agent's text-based intent into a frame-perfect video edit.

We're moving from "Tools for Editors" (e.g. Premiere Pro) to "Directors for Agents."

The Solution: An Execution Layer for AI Agents

Gemini Studio is the infrastructure that gives AI agents hands.

We built the deterministic engine that allows an agent to:

Ingest raw footage (screen recordings, generated clips, uploads)
Organize & search — Assets and clips are indexed by content. No manual renaming; find anything by describing it (the wide shot, that B-roll of the product). The agent uses the same semantic layer for both assets and clips.
Understand your intent—that title card, the drone shot, zoom in on the error—no file names, no hunting
Execute the edit programmatically—frame-perfect, no human in the loop

This isn't a chatbot wrapper. The agent has real agency: it calls the renderer, manipulates the timeline, triggers Veo 3/Nano Banana Pro/Lyria/Chirp generation, and proactively notifies you when your video is ready. Gemini 3 Pro becomes the reasoning layer for the entire production stack.

The result: Video creation transforms from a manual craft into a scalable API call.

Why Not a Plugin? A Ground-Up Redesign

You can't bolt agentic capabilities onto software built for humans.

A Premiere Pro plugin is fundamentally limited: it automates UI clicks, parses menus, and simulates mouse movements. It's brittle, slow, and constrained by the editor's human-centric architecture. The agent is a guest in someone else's house, following rules it didn't write.

Gemini Studio is built from the ground up with agent-native tools and code.

Agent-Native Architecture

Aspect	Plugin Approach	Gemini Studio (Ground-Up)
Timeline Control	UI automation, brittle clicks	Programmatic API—deterministic, version-controlled
Asset Resolution	File paths, manual matching	Semantic understanding—resolve by meaning, not filename
Rendering	Export dialogs, progress bars	Headless renderer—API-driven, event-based
State Management	Screen scraping, guessing	Native state—the agent knows exactly what's on the timeline
Iteration	Can't watch its own work	Full loop—agent analyzes output and iterates autonomously
Branching	Impossible—single timeline	Git-style branches—agent edits on branches, you merge

What Only a Ground-Up Design Enables

1. Semantic Asset Resolution Plugins can't change how Premiere indexes assets. We built semantic understanding into the core—every upload is analyzed, every clip is searchable by content. The agent doesn't need file paths; it resolves "the drone shot" by understanding what's in your library.

2. Deterministic Rendering Plugins trigger exports through UI dialogs. Our renderer is headless and API-driven—the agent calls render() with exact parameters, gets events on completion, and can iterate without human intervention.

3. Version-Controlled Timelines Traditional editors have one timeline state. We built branching into the architecture—the agent edits on a branch, you review and merge. This requires ground-up state management that plugins can't provide.

4. Agent-Native Tools Our 30+ tools aren't wrappers around UI actions—they're first-class operations designed for programmatic control. add_clip(), apply_transition(), search_assets(), createComponent() are deterministic functions the agent calls directly, not UI simulations.

5. Autonomous Iteration The agent can watch its own renders, critique them, and adjust—a capability that requires tight integration between rendering, analysis, and timeline manipulation. Plugins can't close this loop.

The code is agent-native. Every component—from asset ingestion to final render—is designed for programmatic control. The agent isn't simulating a human editor; it's using tools built for it.

Why This Changes Everything

Vibe Editing

We had vibe coding. Now we have vibe editing. Describe the feeling you want. "Make it punchy." "Slow it down for drama." "Add energy to this section." "Give me a typewriter intro with a glitch on every 5th character." The agent understands vibes and translates them into concrete editing decisions—cuts, zooms, pacing, transitions—and even writes custom motion graphics from scratch when nothing in a template library would do.

The Cursor for Video Editing

Just like Cursor revolutionized coding by letting AI agents write alongside you, Gemini Studio lets AI agents edit alongside you. Same project. Same timeline. Human and agent, co-directing in real-time. And just like Cursor, the agent doesn't just autocomplete—it writes entire components, previews them, and iterates.

Git-Style Branching for Video

Your timeline is version-controlled. The cloud agent edits on a branch. You review the changes. Merge what you like, discard what you don't. Split timelines, experiment freely, sync seamlessly.

Feature	What It Enables
Semantic assets & clips	No manual renaming—the system organizes and indexes by content. Search your library in plain language; refer to assets and clips by what they are. The agent resolves "the drone shot," "that B-roll," etc. by meaning, not filename.
Vibe Editing	Intent-based editing ("make it cinematic," "add energy here")
Programmatic motion graphics	Agent writes freeform Motion Canvas components—no template ceiling. Typewriter text, animated charts, branded overlays, anything describable
Real-time Sync	Agent edits appear live in your timeline
Branching	Non-destructive experimentation
Merge/Split	Combine agent work with your own edits

This isn't automation. This is collaboration between human directors and AI agents. Gemini Studio is the code-to-video layer—where vibe editing meets programmatic rendering.

Gemini 3 Pro: The Reasoning Layer

Gemini 3 Pro isn't just integrated—it's the brain that makes agentic video possible. We leverage its state-of-the-art reasoning and native multimodal understanding to power every layer of the stack.

Agent Brain (LangGraph + Gemini 3 Pro) Every interaction flows through Gemini 3 Pro. It reasons over project state, decides which tools to invoke, and orchestrates the entire editing pipeline. We use dynamic thinking_level to balance reasoning depth with response latency. Without Gemini 3 Pro's reasoning and tool use, there is no execution layer—only a traditional UI waiting for human input.

Multimodal Understanding (1M Token Context Window) The agent doesn't just receive text—it sees and hears. Gemini 3 Pro can comprehend video, images, and audio natively through its 1 million token context window. We use the media_resolution parameter to optimize token usage while maintaining fidelity for scene detection, object recognition, and transcription.

Asset & Clip Intelligence Pipeline Every uploaded asset and every clip goes through Gemini 3 Pro's multimodal analysis:

Scene Detection — Automatic boundary identification using native video understanding
Object Recognition — Context-aware detection throughout the video
Speech Transcription — Full audio-to-text with word-level timestamps
Semantic Understanding — High-level analysis ("what's happening here?")

The system organizes your library by content—no renaming files by hand. Search assets and clips in plain language; the agent uses the same indexing. It doesn't just know that you have a video—it knows what's in it, frame by frame. This is the moat: you say put the title card over the drone shot and the agent resolves which image and which clip by meaning, not by filename. No other editor does this.

Generative Pipeline (Veo 3, Nano Banana Pro, Lyria, Chirp) The agent doesn't just edit existing footage—it creates. Need b-roll? Veo 3. Need a thumbnail? Nano Banana Pro. Need background music? Lyria. Need narration? Chirp TTS. These aren't add-ons; they're first-class tools the agent invokes autonomously based on narrative intent.

The Stack:

Layer	Role
Gemini 3 Pro	Reasoning + tool orchestration + multimodal understanding
Files API	Media upload and processing
Veo 3 / Nano Banana Pro / Lyria / Chirp	Generative media creation
Motion Canvas	Deterministic frame-perfect rendering

This is the full loop: ingest → perceive → reason → generate → render.

Components

Component	Tech	Port (default)	README
app	Next.js	3000	app/README.md
langgraph_server	FastAPI, LangGraph, Gemini	8000	langgraph_server/README.md
Telegram agent	Same LangGraph server, webhook	—	langgraph_server/README.md
asset-service	FastAPI, GCS, Firestore	8081	asset-service/README.md
renderer	Express, BullMQ, Puppeteer, FFmpeg	4000	renderer/README.md
scene	Motion Canvas, Vite	(build only)	—
scene-compiler	esbuild (default), optional Vite	4001	See Scene Compiler
video-effects-service	FastAPI, Replicate	—	video-effects-service/README.md
billing-service	NestJS, Firebase	—	billing-service/README.md

How the Execution Layer Works

1. Agent Receives Intent — User speaks naturally (web or Telegram). The Gemini 3 Pro agent parses the request and plans the execution.

2. Tools Execute Autonomously — The agent invokes 30+ tools: timeline manipulation, asset search, Veo generation, image creation, TTS, and custom component creation. Each tool is a deterministic operation the agent controls.

3. Renderer Produces Output — Motion Canvas renders the final video headlessly—pixel-perfect, production-ready. Pub/Sub events notify the agent on completion.

4. Agent Closes the Loop — "Your video is ready." The agent proactively informs the user. No polling. No waiting. Full autonomy.

Motion Canvas: The Perfect Rendering Layer for LLMs

Motion Canvas is the secret sauce that makes agentic video editing possible.

We chose Motion Canvas as our rendering engine because it's built for code-first animation—exactly what LLMs excel at. Unlike traditional video editors that require UI automation, Motion Canvas uses React-like TypeScript code that agents can generate naturally.

Why Motion Canvas is Perfect for AI Agents

Code-First Architecture Motion Canvas animations are written as TypeScript generator functions. The agent doesn't simulate clicks or drags—it writes code:

export default makeScene2D(function* (view) {
  const circle = createRef<Circle>();
  view.add(<Circle ref={circle} width={320} height={320} fill={'blue'} />);
  yield* circle().scale(2, 0.3);
  yield* circle().fill('green', 0.3);
});

This is exactly what LLMs are trained to do: generate code. The agent can compose complex animations, transitions, and effects by writing TypeScript—a task it's already excellent at.

Multimodal Capabilities Motion Canvas integrates seamlessly with Gemini's multimodal understanding. The agent can:

Analyze video frames to understand composition
Generate code that matches visual intent
Iterate by watching renders and adjusting code
Compose complex scenes with multiple layers, effects, and transitions

Deterministic & Headless Motion Canvas renders deterministically—same code, same output, every time. Combined with Puppeteer, we can render headlessly in the cloud. The agent calls render(), gets pixel-perfect output, and can iterate without human intervention.

Production-Ready Motion Canvas isn't a prototype—it's battle-tested for production-quality animations. The agent generates code that produces broadcast-ready video, not experimental output.

The Result: Motion Canvas turns video editing from a visual craft into a coding problem—and coding is what LLMs do best. The agent writes TypeScript, Motion Canvas renders it, and you get professional video. This is why agentic video editing works: we're using the right tool for the job.

Programmatic Motion Graphics: The Agent Writes Components, Not Templates

We had vibe coding. Now we have vibe editing. Gemini Studio is the code-to-video layer.

Most "AI video tools" give you a fixed library of templates and effects. Gemini Studio does something fundamentally different: the agent writes real Motion Canvas components from scratch—freeform TypeScript code with signals, generators, tweens, and the full animation runtime. There is no template ceiling. If you can describe it, the agent can build it.

How It Works

When you say "make me a typewriter animation" or "add a progress ring that fills to 75%", the agent doesn't select from a menu. It writes a complete Motion Canvas component:

export class TypewriterText extends Node {
  @initial('Hello World') @signal()
  public declare readonly fullText: SimpleSignal<string, this>;

  @initial(0.05) @signal()
  public declare readonly charDelay: SimpleSignal<number, this>;

  private readonly progress = createSignal(0);

  public constructor(props?: TypewriterTextProps) {
    super({ ...props });
    this.add(
      <Txt
        text={() => this.fullText().slice(0, Math.floor(
          this.progress() * this.fullText().length
        ))}
        fill={'#ffffff'}
        fontSize={48}
        fontFamily={'JetBrains Mono'}
      />
    );
  }

  public *reveal(duration?: number) {
    yield* this.progress(1, duration ?? this.fullText().length * this.charDelay());
  }
}

This component is compiled on the fly, hot-loaded into the live preview, and rendered to final video—all without the user touching code. The agent can also iterate: watch the result, adjust timing, change easing, add effects, and recompile.

The Pipeline

Agent writes TSX → Scene Compiler builds it → Preview renders live → Renderer exports final video

Stage	What Happens
Create	Agent calls `createComponent` with full TSX code, input definitions, and a class name
Compile	Scene Compiler service compiles the component into the scene bundle (esbuild by default; see below). A barrel file is auto-generated—no manual registration needed
Preview	`ScenePlayer` detects the new component asset, recompiles, and renders it live in the browser at 30fps. Changes appear in real time
Control	Input definitions (`inputDefs`) surface as controls in the timeline inspector. Users tweak values (text, color, speed, size) without code
Render	The renderer compiles with the same component files and exports production-quality video via headless Puppeteer + FFmpeg

Scene Compiler: esbuild for Millisecond-Fast Compilation

The scene-compiler service turns Motion Canvas TypeScript (project + scenes + custom components) into a single JavaScript bundle that the preview and renderer load. It used to use Vite with the Motion Canvas plugin; we now use esbuild by default for roughly 25× faster cold compiles and instant cache hits.

Scenario	Before (Vite)	After (esbuild)
Cold compile	~3.5 s	~130 ms
Same inputs (cache hit)	~3.5 s	0 ms
Custom component change	~3.5 s	~70 ms

How it works: The compiler replicates the Motion Canvas Vite plugin behavior (e.g. ?scene wrappers, .meta and .glsl handling, virtual:settings.meta, custom component injection) in a custom esbuild plugin. Because esbuild strips URL query suffixes like ?scene before plugin callbacks see them, we read and rewrite project.ts at build time so scene imports use a custom __mc_scene__ suffix that the plugin can resolve. The result is the same bundle the Vite pipeline produced, with a small in-memory LRU cache (50 entries, 5 min TTL) so repeated compiles with identical inputs return immediately.

Rollback: To use the original Vite-based compiler, set SCENE_COMPILER_ENGINE=vite in the scene-compiler environment. No code changes required.

Port: 4001
Config: scene-compiler/ (e.g. BASE_SCENE_DIR, SCENE_COMPILER_SHARED_SECRET)

Inputs vs. Animation

Component inputs (inputDefs) are static values set on the timeline—text content, colors, sizes, speeds. They don't animate. For temporal behavior (typewriter reveals, counters, progress bars, pulsing effects), the agent uses generator methods and signals inside the component. Inputs control what and how fast; generators and signals create the motion.

This is a deliberate design: it gives users simple controls while the agent handles the complexity of animation code.

What This Enables

Capability	Example
Freeform motion graphics	Progress rings, stat counters, animated charts, lower thirds—anything describable
Branded overlays	Custom components with your colors, fonts, and layout
Data-driven visuals	Components that take numbers/text as inputs and animate them
Iterative refinement	Agent previews its own component, adjusts timing/easing, recompiles
Zero-template ceiling	No fixed library. The agent writes new components for every request

Component plugins: data viz, maps, color, and noise

The scene compiler bundles first-class plugins so the agent can generate components that go far beyond static shapes and text. Every plugin is compute-then-render: the library produces data (path strings, positions, colors, numbers); Motion Canvas renders it. No DOM, no charting framework—just the right primitives for AI-generated video.

Plugin	What it does	What the agent can build
d3-geo	Geographic projections and SVG path strings from GeoJSON	Animated maps, region highlights, country/continent outlines
d3-shape	Arc, pie, line, and area path generators	Pie charts, donuts, line charts, area charts, animated data series
d3-scale	Map data domains to pixel ranges and color scales	Bar positions, axes, data-driven colors and sizes
d3-hierarchy	Tree, treemap, and pack layout algorithms	Tree diagrams, treemaps, sunbursts, nested visualizations
simplex-noise	Procedural 2D/3D/4D noise	Organic motion, flowing backgrounds, terrain, generative-style effects
chroma-js	Color scales, interpolation (LAB/LCH), lighten/darken/saturate	Palettes, data-driven colors, accessible contrast, brand-matched fills

The in-app Monaco editor is wired with TypeScript types for all of these: when the agent (or you) edits a component, IntelliSense and type checking work for d3-shape, d3-scale, chroma, and the rest. Result: the agent can say "animate a bar chart from this data", "draw a map of Europe and highlight Germany", or "add a smooth noise-based background" and produce real, compilable code with full editor support.

This is what separates Gemini Studio from every other AI video tool. Others wrap a fixed API. We have a real runtime underneath that the AI programs against. The agent doesn't pick from a menu—it writes code, compiles it, previews it, and ships it.

Autonomous Video Production: The Agent Can Watch Its Own Work

This is the moat. Gemini Studio is the first platform where an AI agent can autonomously iterate on video edits without human intervention.

The Iteration Loop

Why This Matters

Traditional AI video tools: Generate → Done. No feedback loop. No iteration.

Gemini Studio: Generate → Watch → Critique → Adjust → Repeat → Deliver.

The agent has:

Eyes (Gemini multimodal can analyze video content)
Hands (30+ tools for timeline manipulation, component creation, and media generation)
Creativity (writes custom motion graphics from scratch—no template ceiling)
Judgment (can evaluate pacing, cuts, transitions, and component design)
Memory (maintains context across iterations)

This is the difference between a tool that produces output and an agent that produces quality output.

Live Voice Chat: Real-Time Collaboration

Gemini Studio includes Live API integration for real-time voice conversations with the AI agent. The agent can execute tools, manipulate your timeline, and even "see" previews of your work—all through natural voice commands.

Current: When you ask the agent to watch your video, it extracts key frames using Mediabunny and analyzes them to understand the composition.

Future Vision: The Live API will continuously stream what you see—the same preview panel, at 5x playback speed—directly to the agent. This creates a true "pair editing" experience where the AI watches alongside you in real-time, ready to act on voice commands like "that cut was too early" or "add a transition here". The agent becomes a co-director who sees exactly what you see.

Render Quality Controls

The agent intelligently chooses render settings based on intent:

Mode	Settings	Use Case
Preview	`quality='low'`, `fps=15`, `range=[start,end]`	Fast iteration, reviewing segments
Draft	`quality='web'`, full timeline	Near-final review
Production	`quality='studio'`, full timeline	Final delivery

Tech stack

Category	Technology	Purpose
Frontend	Next.js, React	Web app, timeline editor, chat UI
Agent	LangGraph, Gemini	Conversational agent, tools
Render	Motion Canvas, Puppeteer	Headless video composition
Queue	BullMQ, Redis	Render job queue
Backend	FastAPI (Python)	LangGraph server, asset service, video-effects-service
Storage	GCS, Firestore	Assets, metadata, projects
Events	Google Cloud Pub/Sub	Render completion, pipeline events
Auth	Firebase	Auth, projects, chat sessions
Monorepo	pnpm workspaces	app, scene, renderer, shared

The codebase is a pnpm monorepo with TypeScript (app, scene, renderer) and Python (langgraph_server, asset-service, video-effects-service). The LangGraph server and asset service include tests; the agent and tools are typed and documented for maintainability.

Setup

Prerequisites

Node.js 20+, pnpm 9 (corepack enable pnpm)
Python 3.11+ (e.g. uv or pip)
Redis (for the renderer queue)
Google Cloud – GCS, optional Pub/Sub, Firebase
Chrome or Chromium (for the renderer)

Install

git clone https://github.com/youneslaaroussi/geministudio
cd geministudio
pnpm install
pnpm --filter @gemini-studio/scene run build
pnpm --filter @gemini-studio/renderer run build:headless

Environment

Copy the example env file for each service you run; set API keys and URLs. Details are in each service’s README.

Service	Config
App	`app/env.template` → `app/.env.local`
LangGraph	`langgraph_server/.env.example` → `langgraph_server/.env`
Renderer	`REDIS_URL` (and optional Pub/Sub) in `renderer/`
Asset service	`asset-service/.env.example` → `asset-service/.env`

Run locally

Option A: All services at once (recommended)

Requires Overmind (brew install overmind). Starts all 6 services in a unified TUI with per-service panes, logs, and controls.

pnpm dev

Command	Description
`pnpm dev`	Start all services (Overmind TUI)
`pnpm dev:connect`	Connect to the Overmind session
`pnpm dev:restart`	Restart all or specific services
`pnpm dev:stop`	Stop all services
`pnpm dev:status`	Show service status

In the Overmind TUI: press a service's key (e.g. 1 for app, 2 for renderer) to focus its pane; q to quit.

Option B: Without Overmind

If Overmind isn't installed, use the simple concurrent runner:

pnpm dev:simple

Option C: Separate terminals

Start Redis.
In separate terminals, start each service:

Terminal 1 – Renderer
```
pnpm --filter @gemini-studio/renderer dev
```
Terminal 2 – LangGraph
```
cd langgraph_server && uv run uvicorn langgraph_server.main:app --reload --port 8000
```
Terminal 3 – Asset service
```
cd asset-service && uv run python -m asset_service
```
Terminal 4 – App
```
pnpm --filter app dev
```
Optionally add billing-service (cd billing-service && pnpm start:dev) and video-effects-service (cd video-effects-service && uv run python -m video_effects_service).
Open http://localhost:3000. If the LangGraph server is elsewhere, set NEXT_PUBLIC_LANGGRAPH_URL in the app env.

Production Deployment

See deploy/README.md for full instructions. Key step: CI/CD does not copy service account files — you must manually provision them on the VM:

# Copy service accounts to VM (one-time setup)
gcloud compute scp secrets/google-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute scp secrets/firebase-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute ssh gemini-studio --zone=us-central1-a --command='
  sudo mv /tmp/google-service-account.json /opt/gemini-studio/deploy/secrets/
  sudo mv /tmp/firebase-service-account.json /opt/gemini-studio/deploy/secrets/
  sudo chmod 644 /opt/gemini-studio/deploy/secrets/*.json
'

Repository structure

GeminiStudio/
├── app/                    # Next.js app (editor, chat, assets UI)
├── ai-sdk/                 # Patched Vercel AI SDK (file-url in tool results for Gemini); see Credits
├── scene/                  # Motion Canvas project (Vite)
├── scene-compiler/         # On-demand scene compilation (esbuild default, optional Vite; Express)
├── renderer/               # Render service (Express, BullMQ, headless bundle)
├── langgraph_server/       # LangGraph agent (FastAPI, Gemini 3, tools)
├── asset-service/          # Asset upload & pipeline (Gemini analysis, GCS, Firestore)
├── video-effects-service/  # Video effects (FastAPI, Replicate)
├── billing-service/        # Credits & billing (NestJS)
├── shared/                 # Shared tool manifest (shared/tools/manifest.json)
├── deploy/                 # Terraform, Caddy, docker-compose
├── package.json            # Root pnpm workspace
├── pnpm-workspace.yaml
└── README.md               # This file

Key areas:

Area	Path
Agent & tools	`langgraph_server/agent.py`, `langgraph_server/tools/`
AI SDK fork	`ai-sdk/` (patched `@ai-sdk/google` for multimodal tool results; app uses `file:../ai-sdk/packages/google`)
Tool manifest	`shared/tools/manifest.json`
Scene compiler	`scene-compiler/` (esbuild-based on-demand compilation with custom component injection; optional Vite via `SCENE_COMPILER_ENGINE=vite`)
Renderer	`renderer/`
Scene	`scene/` (Motion Canvas project, component registry, clip playback)
App	`app/app/`

Each service has its own README for setup and deployment.

Credits & Resources

Gemini Studio is built on top of incredible open-source projects and cloud services. We're grateful to the communities that made this possible.

Rendering & Media

Motion Canvas — The rendering engine that makes code-first video editing possible. React-like TypeScript API perfect for LLM-generated animations. GitHub · Docs
FFmpeg — Audio/video transcoding, merging, and encoding in the renderer. We use it via fluent-ffmpeg.
Puppeteer — Headless Chrome for running Motion Canvas and exporting frame-perfect video in the cloud.
Mediabunny — JavaScript library for reading, writing, and converting video and audio in the browser. Web-first media toolkit.

AI & Agent

Vercel AI SDK — React hooks and streaming for the chat UI. We use ai, @ai-sdk/react, and @ai-sdk/google for the frontend agent experience. Multimodal tool results: The upstream SDK did not support sending video, image, or audio from tool results to Gemini (tool-returned files were serialized as JSON text, so the model never saw the media). We implemented file-url handling in the Google provider so that when a tool returns a file (e.g. our watchVideo / watchAsset tools), the model receives it as real fileData and can see/hear the content. This is a significant contribution: it enables the “agent watches its own work” loop and any agent that returns media from tools. We ship a patched copy under ai-sdk/ (Apache 2.0, see ai-sdk/LICENSE) and plan to submit this as a PR to vercel/ai.
LangGraph — Agent orchestration and tool execution. Powers our conversational agent. GitHub
Google Gemini 3 Pro — Reasoning, tool use, multimodal understanding (video, images, audio), and generative APIs (Veo, Imagen, Lyria, Chirp).

Google Cloud & Firebase

Used for background asset understanding and indexing (the agent’s live reasoning and multimodal understanding are powered by Gemini).

Cloud Video Intelligence API — Background shot detection, label detection, and video understanding in the asset pipeline so the library is searchable.
Cloud Speech-to-Text — Background transcription with word-level timestamps so assets are searchable and captions can be generated.
Cloud Text-to-Speech — Narration and TTS (Chirp) integration.
Cloud Storage — Asset and render output storage.
Cloud Pub/Sub — Render completion and pipeline events.
Firebase — Auth, Firestore (projects, metadata), and real-time sync.

Search, Conversion & Effects

Algolia — Semantic and full-text search over the asset library.
CloudConvert — Image and document conversion in the asset pipeline.
Replicate — Video effects (e.g. background removal, chroma key) via the video-effects-service.

Frameworks & Infrastructure

Next.js — Web app and API routes.
FastAPI — LangGraph server, asset-service, video-effects-service.
BullMQ — Render job queue (Redis-backed).
Automerge — CRDTs for collaborative timeline and branch sync.

Built by Younes Laaroussi

	Link
Site	youneslaaroussi.ca
LinkedIn	linkedin.com/in/younes-laaroussi
𝕏	@younesistaken

License

Elastic License 2.0 (ELv2) – See LICENSE.

You may use, copy, distribute, and make derivative works of the software. You may not offer it to third parties as a hosted or managed service (i.e. you cannot run “Gemini Studio as a service” for others). You must keep license and copyright notices intact and pass these terms on to anyone who receives the software from you.

Name		Name	Last commit message	Last commit date
Latest commit History 351 Commits
.circleci		.circleci
.githooks		.githooks
ai-sdk		ai-sdk
app		app
asset-service		asset-service
assets		assets
billing-service		billing-service
deploy		deploy
langgraph_server		langgraph_server
renderer		renderer
scene-compiler		scene-compiler
scene		scene
scripts		scripts
shared		shared
video-effects-service		video-effects-service
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Procfile.dev		Procfile.dev
README.md		README.md
check-firebase-secrets.sh		check-firebase-secrets.sh
database.rules.json		database.rules.json
firestore.rules		firestore.rules
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json

License

youneslaaroussi/GeminiStudio

Folders and files

Latest commit

History

Repository files navigation

Gemini Studio

The Execution Layer for Agentic Video

Demo & submission

Table of Contents

The Problem: AI Can Generate Pixels, But It Can't Produce Video

The Solution: An Execution Layer for AI Agents

Why Not a Plugin? A Ground-Up Redesign

Agent-Native Architecture

What Only a Ground-Up Design Enables

Why This Changes Everything

Vibe Editing

The Cursor for Video Editing

Git-Style Branching for Video

Gemini 3 Pro: The Reasoning Layer

Components

How the Execution Layer Works

Motion Canvas: The Perfect Rendering Layer for LLMs

Why Motion Canvas is Perfect for AI Agents

Programmatic Motion Graphics: The Agent Writes Components, Not Templates

How It Works

The Pipeline

Scene Compiler: esbuild for Millisecond-Fast Compilation

Inputs vs. Animation

What This Enables

Component plugins: data viz, maps, color, and noise

Autonomous Video Production: The Agent Can Watch Its Own Work

The Iteration Loop

Why This Matters

Live Voice Chat: Real-Time Collaboration

Render Quality Controls

Tech stack

Setup

Prerequisites

Install

Environment

Run locally

Production Deployment

Repository structure

Credits & Resources

Rendering & Media

AI & Agent

Google Cloud & Firebase

Search, Conversion & Effects

Frameworks & Infrastructure

Built by Younes Laaroussi

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Languages