Generative AI solved pixel generation. We're solving video production.
The deterministic engine that gives AI agents the hands to edit video.
Built for the Gemini 3 Hackathon
The Bet: In 2 years, manual video editing will be obsolete for 90% of use cases. The bottleneck isn't AI models—it's the lack of infrastructure that lets agents actually edit. We're building that missing layer.
The Moat: Gemini Studio is the first video editor that understands your assets and clips semantically. The system organizes everything for you—no more renaming each file by hand or wasting hours in bins. Search your library in plain language (e.g. the drone shot over the water, the clip where the crowd cheers); the agent uses the same understanding to resolve which asset or clip you mean. Semantic understanding turns natural language into precise edits.
| Item | Link |
|---|---|
| Live demo | https://www.geminivideo.studio/ |
| Repository | https://github.com/youneslaaroussi/geministudio |
- The Problem
- The Solution
- Why Not a Plugin? A Ground-Up Redesign
- Why This Changes Everything
- Gemini 3 Pro: The Reasoning Layer
- Architecture
- How the Execution Layer Works
- Programmatic Motion Graphics: The Agent Writes Components, Not Templates
- Scene Compiler: esbuild for Millisecond-Fast Compilation
- Component plugins: data viz, maps, color, and noise
- Motion Canvas: The Perfect Rendering Layer for LLMs
- Autonomous Video Production: The Agent Can Watch Its Own Work
- Tech Stack
- Setup
- Credits & Resources
- License
Veo solved generation. But it didn't solve production.
A raw AI-generated clip is not a finished video. It has no narrative structure, no pacing, no intent. The bottleneck isn't the model—it's the lack of a rendering engine that can translate an agent's text-based intent into a frame-perfect video edit.
We're moving from "Tools for Editors" (e.g. Premiere Pro) to "Directors for Agents."
Gemini Studio is the infrastructure that gives AI agents hands.
We built the deterministic engine that allows an agent to:
- Ingest raw footage (screen recordings, generated clips, uploads)
- Organize & search — Assets and clips are indexed by content. No manual renaming; find anything by describing it (the wide shot, that B-roll of the product). The agent uses the same semantic layer for both assets and clips.
- Understand your intent—that title card, the drone shot, zoom in on the error—no file names, no hunting
- Execute the edit programmatically—frame-perfect, no human in the loop
This isn't a chatbot wrapper. The agent has real agency: it calls the renderer, manipulates the timeline, triggers Veo 3/Nano Banana Pro/Lyria/Chirp generation, and proactively notifies you when your video is ready. Gemini 3 Pro becomes the reasoning layer for the entire production stack.
The result: Video creation transforms from a manual craft into a scalable API call.
You can't bolt agentic capabilities onto software built for humans.
A Premiere Pro plugin is fundamentally limited: it automates UI clicks, parses menus, and simulates mouse movements. It's brittle, slow, and constrained by the editor's human-centric architecture. The agent is a guest in someone else's house, following rules it didn't write.
Gemini Studio is built from the ground up with agent-native tools and code.
| Aspect | Plugin Approach | Gemini Studio (Ground-Up) |
|---|---|---|
| Timeline Control | UI automation, brittle clicks | Programmatic API—deterministic, version-controlled |
| Asset Resolution | File paths, manual matching | Semantic understanding—resolve by meaning, not filename |
| Rendering | Export dialogs, progress bars | Headless renderer—API-driven, event-based |
| State Management | Screen scraping, guessing | Native state—the agent knows exactly what's on the timeline |
| Iteration | Can't watch its own work | Full loop—agent analyzes output and iterates autonomously |
| Branching | Impossible—single timeline | Git-style branches—agent edits on branches, you merge |
1. Semantic Asset Resolution Plugins can't change how Premiere indexes assets. We built semantic understanding into the core—every upload is analyzed, every clip is searchable by content. The agent doesn't need file paths; it resolves "the drone shot" by understanding what's in your library.
2. Deterministic Rendering
Plugins trigger exports through UI dialogs. Our renderer is headless and API-driven—the agent calls render() with exact parameters, gets events on completion, and can iterate without human intervention.
3. Version-Controlled Timelines Traditional editors have one timeline state. We built branching into the architecture—the agent edits on a branch, you review and merge. This requires ground-up state management that plugins can't provide.
4. Agent-Native Tools
Our 30+ tools aren't wrappers around UI actions—they're first-class operations designed for programmatic control. add_clip(), apply_transition(), search_assets(), createComponent() are deterministic functions the agent calls directly, not UI simulations.
5. Autonomous Iteration The agent can watch its own renders, critique them, and adjust—a capability that requires tight integration between rendering, analysis, and timeline manipulation. Plugins can't close this loop.
The code is agent-native. Every component—from asset ingestion to final render—is designed for programmatic control. The agent isn't simulating a human editor; it's using tools built for it.
We had vibe coding. Now we have vibe editing. Describe the feeling you want. "Make it punchy." "Slow it down for drama." "Add energy to this section." "Give me a typewriter intro with a glitch on every 5th character." The agent understands vibes and translates them into concrete editing decisions—cuts, zooms, pacing, transitions—and even writes custom motion graphics from scratch when nothing in a template library would do.
Just like Cursor revolutionized coding by letting AI agents write alongside you, Gemini Studio lets AI agents edit alongside you. Same project. Same timeline. Human and agent, co-directing in real-time. And just like Cursor, the agent doesn't just autocomplete—it writes entire components, previews them, and iterates.
Your timeline is version-controlled. The cloud agent edits on a branch. You review the changes. Merge what you like, discard what you don't. Split timelines, experiment freely, sync seamlessly.
| Feature | What It Enables |
|---|---|
| Semantic assets & clips | No manual renaming—the system organizes and indexes by content. Search your library in plain language; refer to assets and clips by what they are. The agent resolves "the drone shot," "that B-roll," etc. by meaning, not filename. |
| Vibe Editing | Intent-based editing ("make it cinematic," "add energy here") |
| Programmatic motion graphics | Agent writes freeform Motion Canvas components—no template ceiling. Typewriter text, animated charts, branded overlays, anything describable |
| Real-time Sync | Agent edits appear live in your timeline |
| Branching | Non-destructive experimentation |
| Merge/Split | Combine agent work with your own edits |
This isn't automation. This is collaboration between human directors and AI agents. Gemini Studio is the code-to-video layer—where vibe editing meets programmatic rendering.
Gemini 3 Pro isn't just integrated—it's the brain that makes agentic video possible. We leverage its state-of-the-art reasoning and native multimodal understanding to power every layer of the stack.
Agent Brain (LangGraph + Gemini 3 Pro)
Every interaction flows through Gemini 3 Pro. It reasons over project state, decides which tools to invoke, and orchestrates the entire editing pipeline. We use dynamic thinking_level to balance reasoning depth with response latency. Without Gemini 3 Pro's reasoning and tool use, there is no execution layer—only a traditional UI waiting for human input.
Multimodal Understanding (1M Token Context Window)
The agent doesn't just receive text—it sees and hears. Gemini 3 Pro can comprehend video, images, and audio natively through its 1 million token context window. We use the media_resolution parameter to optimize token usage while maintaining fidelity for scene detection, object recognition, and transcription.
Asset & Clip Intelligence Pipeline Every uploaded asset and every clip goes through Gemini 3 Pro's multimodal analysis:
- Scene Detection — Automatic boundary identification using native video understanding
- Object Recognition — Context-aware detection throughout the video
- Speech Transcription — Full audio-to-text with word-level timestamps
- Semantic Understanding — High-level analysis ("what's happening here?")
The system organizes your library by content—no renaming files by hand. Search assets and clips in plain language; the agent uses the same indexing. It doesn't just know that you have a video—it knows what's in it, frame by frame. This is the moat: you say put the title card over the drone shot and the agent resolves which image and which clip by meaning, not by filename. No other editor does this.
Generative Pipeline (Veo 3, Nano Banana Pro, Lyria, Chirp) The agent doesn't just edit existing footage—it creates. Need b-roll? Veo 3. Need a thumbnail? Nano Banana Pro. Need background music? Lyria. Need narration? Chirp TTS. These aren't add-ons; they're first-class tools the agent invokes autonomously based on narrative intent.
The Stack:
| Layer | Role |
|---|---|
| Gemini 3 Pro | Reasoning + tool orchestration + multimodal understanding |
| Files API | Media upload and processing |
| Veo 3 / Nano Banana Pro / Lyria / Chirp | Generative media creation |
| Motion Canvas | Deterministic frame-perfect rendering |
This is the full loop: ingest → perceive → reason → generate → render.
| Component | Tech | Port (default) | README |
|---|---|---|---|
| app | Next.js | 3000 | app/README.md |
| langgraph_server | FastAPI, LangGraph, Gemini | 8000 | langgraph_server/README.md |
| Telegram agent | Same LangGraph server, webhook | — | langgraph_server/README.md |
| asset-service | FastAPI, GCS, Firestore | 8081 | asset-service/README.md |
| renderer | Express, BullMQ, Puppeteer, FFmpeg | 4000 | renderer/README.md |
| scene | Motion Canvas, Vite | (build only) | — |
| scene-compiler | esbuild (default), optional Vite | 4001 | See Scene Compiler |
| video-effects-service | FastAPI, Replicate | — | video-effects-service/README.md |
| billing-service | NestJS, Firebase | — | billing-service/README.md |
1. Agent Receives Intent — User speaks naturally (web or Telegram). The Gemini 3 Pro agent parses the request and plans the execution.
2. Tools Execute Autonomously — The agent invokes 30+ tools: timeline manipulation, asset search, Veo generation, image creation, TTS, and custom component creation. Each tool is a deterministic operation the agent controls.
3. Renderer Produces Output — Motion Canvas renders the final video headlessly—pixel-perfect, production-ready. Pub/Sub events notify the agent on completion.
4. Agent Closes the Loop — "Your video is ready." The agent proactively informs the user. No polling. No waiting. Full autonomy.
Motion Canvas is the secret sauce that makes agentic video editing possible.
We chose Motion Canvas as our rendering engine because it's built for code-first animation—exactly what LLMs excel at. Unlike traditional video editors that require UI automation, Motion Canvas uses React-like TypeScript code that agents can generate naturally.
Code-First Architecture Motion Canvas animations are written as TypeScript generator functions. The agent doesn't simulate clicks or drags—it writes code:
export default makeScene2D(function* (view) {
const circle = createRef<Circle>();
view.add(<Circle ref={circle} width={320} height={320} fill={'blue'} />);
yield* circle().scale(2, 0.3);
yield* circle().fill('green', 0.3);
});This is exactly what LLMs are trained to do: generate code. The agent can compose complex animations, transitions, and effects by writing TypeScript—a task it's already excellent at.
Multimodal Capabilities Motion Canvas integrates seamlessly with Gemini's multimodal understanding. The agent can:
- Analyze video frames to understand composition
- Generate code that matches visual intent
- Iterate by watching renders and adjusting code
- Compose complex scenes with multiple layers, effects, and transitions
Deterministic & Headless
Motion Canvas renders deterministically—same code, same output, every time. Combined with Puppeteer, we can render headlessly in the cloud. The agent calls render(), gets pixel-perfect output, and can iterate without human intervention.
Production-Ready Motion Canvas isn't a prototype—it's battle-tested for production-quality animations. The agent generates code that produces broadcast-ready video, not experimental output.
The Result: Motion Canvas turns video editing from a visual craft into a coding problem—and coding is what LLMs do best. The agent writes TypeScript, Motion Canvas renders it, and you get professional video. This is why agentic video editing works: we're using the right tool for the job.
We had vibe coding. Now we have vibe editing. Gemini Studio is the code-to-video layer.
Most "AI video tools" give you a fixed library of templates and effects. Gemini Studio does something fundamentally different: the agent writes real Motion Canvas components from scratch—freeform TypeScript code with signals, generators, tweens, and the full animation runtime. There is no template ceiling. If you can describe it, the agent can build it.
When you say "make me a typewriter animation" or "add a progress ring that fills to 75%", the agent doesn't select from a menu. It writes a complete Motion Canvas component:
export class TypewriterText extends Node {
@initial('Hello World') @signal()
public declare readonly fullText: SimpleSignal<string, this>;
@initial(0.05) @signal()
public declare readonly charDelay: SimpleSignal<number, this>;
private readonly progress = createSignal(0);
public constructor(props?: TypewriterTextProps) {
super({ ...props });
this.add(
<Txt
text={() => this.fullText().slice(0, Math.floor(
this.progress() * this.fullText().length
))}
fill={'#ffffff'}
fontSize={48}
fontFamily={'JetBrains Mono'}
/>
);
}
public *reveal(duration?: number) {
yield* this.progress(1, duration ?? this.fullText().length * this.charDelay());
}
}This component is compiled on the fly, hot-loaded into the live preview, and rendered to final video—all without the user touching code. The agent can also iterate: watch the result, adjust timing, change easing, add effects, and recompile.
Agent writes TSX → Scene Compiler builds it → Preview renders live → Renderer exports final video
| Stage | What Happens |
|---|---|
| Create | Agent calls createComponent with full TSX code, input definitions, and a class name |
| Compile | Scene Compiler service compiles the component into the scene bundle (esbuild by default; see below). A barrel file is auto-generated—no manual registration needed |
| Preview | ScenePlayer detects the new component asset, recompiles, and renders it live in the browser at 30fps. Changes appear in real time |
| Control | Input definitions (inputDefs) surface as controls in the timeline inspector. Users tweak values (text, color, speed, size) without code |
| Render | The renderer compiles with the same component files and exports production-quality video via headless Puppeteer + FFmpeg |
The scene-compiler service turns Motion Canvas TypeScript (project + scenes + custom components) into a single JavaScript bundle that the preview and renderer load. It used to use Vite with the Motion Canvas plugin; we now use esbuild by default for roughly 25× faster cold compiles and instant cache hits.
| Scenario | Before (Vite) | After (esbuild) |
|---|---|---|
| Cold compile | ~3.5 s | ~130 ms |
| Same inputs (cache hit) | ~3.5 s | 0 ms |
| Custom component change | ~3.5 s | ~70 ms |
How it works: The compiler replicates the Motion Canvas Vite plugin behavior (e.g. ?scene wrappers, .meta and .glsl handling, virtual:settings.meta, custom component injection) in a custom esbuild plugin. Because esbuild strips URL query suffixes like ?scene before plugin callbacks see them, we read and rewrite project.ts at build time so scene imports use a custom __mc_scene__ suffix that the plugin can resolve. The result is the same bundle the Vite pipeline produced, with a small in-memory LRU cache (50 entries, 5 min TTL) so repeated compiles with identical inputs return immediately.
Rollback: To use the original Vite-based compiler, set SCENE_COMPILER_ENGINE=vite in the scene-compiler environment. No code changes required.
- Port: 4001
- Config:
scene-compiler/(e.g.BASE_SCENE_DIR,SCENE_COMPILER_SHARED_SECRET)
Component inputs (inputDefs) are static values set on the timeline—text content, colors, sizes, speeds. They don't animate. For temporal behavior (typewriter reveals, counters, progress bars, pulsing effects), the agent uses generator methods and signals inside the component. Inputs control what and how fast; generators and signals create the motion.
This is a deliberate design: it gives users simple controls while the agent handles the complexity of animation code.
| Capability | Example |
|---|---|
| Freeform motion graphics | Progress rings, stat counters, animated charts, lower thirds—anything describable |
| Branded overlays | Custom components with your colors, fonts, and layout |
| Data-driven visuals | Components that take numbers/text as inputs and animate them |
| Iterative refinement | Agent previews its own component, adjusts timing/easing, recompiles |
| Zero-template ceiling | No fixed library. The agent writes new components for every request |
The scene compiler bundles first-class plugins so the agent can generate components that go far beyond static shapes and text. Every plugin is compute-then-render: the library produces data (path strings, positions, colors, numbers); Motion Canvas renders it. No DOM, no charting framework—just the right primitives for AI-generated video.
| Plugin | What it does | What the agent can build |
|---|---|---|
| d3-geo | Geographic projections and SVG path strings from GeoJSON | Animated maps, region highlights, country/continent outlines |
| d3-shape | Arc, pie, line, and area path generators | Pie charts, donuts, line charts, area charts, animated data series |
| d3-scale | Map data domains to pixel ranges and color scales | Bar positions, axes, data-driven colors and sizes |
| d3-hierarchy | Tree, treemap, and pack layout algorithms | Tree diagrams, treemaps, sunbursts, nested visualizations |
| simplex-noise | Procedural 2D/3D/4D noise | Organic motion, flowing backgrounds, terrain, generative-style effects |
| chroma-js | Color scales, interpolation (LAB/LCH), lighten/darken/saturate | Palettes, data-driven colors, accessible contrast, brand-matched fills |
The in-app Monaco editor is wired with TypeScript types for all of these: when the agent (or you) edits a component, IntelliSense and type checking work for d3-shape, d3-scale, chroma, and the rest. Result: the agent can say "animate a bar chart from this data", "draw a map of Europe and highlight Germany", or "add a smooth noise-based background" and produce real, compilable code with full editor support.
This is what separates Gemini Studio from every other AI video tool. Others wrap a fixed API. We have a real runtime underneath that the AI programs against. The agent doesn't pick from a menu—it writes code, compiles it, previews it, and ships it.
This is the moat. Gemini Studio is the first platform where an AI agent can autonomously iterate on video edits without human intervention.
Traditional AI video tools: Generate → Done. No feedback loop. No iteration.
Gemini Studio: Generate → Watch → Critique → Adjust → Repeat → Deliver.
The agent has:
- Eyes (Gemini multimodal can analyze video content)
- Hands (30+ tools for timeline manipulation, component creation, and media generation)
- Creativity (writes custom motion graphics from scratch—no template ceiling)
- Judgment (can evaluate pacing, cuts, transitions, and component design)
- Memory (maintains context across iterations)
This is the difference between a tool that produces output and an agent that produces quality output.
Gemini Studio includes Live API integration for real-time voice conversations with the AI agent. The agent can execute tools, manipulate your timeline, and even "see" previews of your work—all through natural voice commands.
Current: When you ask the agent to watch your video, it extracts key frames using Mediabunny and analyzes them to understand the composition.
Future Vision: The Live API will continuously stream what you see—the same preview panel, at 5x playback speed—directly to the agent. This creates a true "pair editing" experience where the AI watches alongside you in real-time, ready to act on voice commands like "that cut was too early" or "add a transition here". The agent becomes a co-director who sees exactly what you see.
The agent intelligently chooses render settings based on intent:
| Mode | Settings | Use Case |
|---|---|---|
| Preview | quality='low', fps=15, range=[start,end] |
Fast iteration, reviewing segments |
| Draft | quality='web', full timeline |
Near-final review |
| Production | quality='studio', full timeline |
Final delivery |
| Category | Technology | Purpose |
|---|---|---|
| Frontend | Next.js, React | Web app, timeline editor, chat UI |
| Agent | LangGraph, Gemini | Conversational agent, tools |
| Render | Motion Canvas, Puppeteer | Headless video composition |
| Queue | BullMQ, Redis | Render job queue |
| Backend | FastAPI (Python) | LangGraph server, asset service, video-effects-service |
| Storage | GCS, Firestore | Assets, metadata, projects |
| Events | Google Cloud Pub/Sub | Render completion, pipeline events |
| Auth | Firebase | Auth, projects, chat sessions |
| Monorepo | pnpm workspaces | app, scene, renderer, shared |
The codebase is a pnpm monorepo with TypeScript (app, scene, renderer) and Python (langgraph_server, asset-service, video-effects-service). The LangGraph server and asset service include tests; the agent and tools are typed and documented for maintainability.
- Node.js 20+, pnpm 9 (
corepack enable pnpm) - Python 3.11+ (e.g.
uvorpip) - Redis (for the renderer queue)
- Google Cloud – GCS, optional Pub/Sub, Firebase
- Chrome or Chromium (for the renderer)
git clone https://github.com/youneslaaroussi/geministudio
cd geministudio
pnpm install
pnpm --filter @gemini-studio/scene run build
pnpm --filter @gemini-studio/renderer run build:headlessCopy the example env file for each service you run; set API keys and URLs. Details are in each service’s README.
| Service | Config |
|---|---|
| App | app/env.template → app/.env.local |
| LangGraph | langgraph_server/.env.example → langgraph_server/.env |
| Renderer | REDIS_URL (and optional Pub/Sub) in renderer/ |
| Asset service | asset-service/.env.example → asset-service/.env |
Option A: All services at once (recommended)
Requires Overmind (brew install overmind). Starts all 6 services in a unified TUI with per-service panes, logs, and controls.
pnpm dev| Command | Description |
|---|---|
pnpm dev |
Start all services (Overmind TUI) |
pnpm dev:connect |
Connect to the Overmind session |
pnpm dev:restart |
Restart all or specific services |
pnpm dev:stop |
Stop all services |
pnpm dev:status |
Show service status |
In the Overmind TUI: press a service's key (e.g. 1 for app, 2 for renderer) to focus its pane; q to quit.
Option B: Without Overmind
If Overmind isn't installed, use the simple concurrent runner:
pnpm dev:simpleOption C: Separate terminals
-
Start Redis.
-
In separate terminals, start each service:
Terminal 1 – Renderer
pnpm --filter @gemini-studio/renderer dev
Terminal 2 – LangGraph
cd langgraph_server && uv run uvicorn langgraph_server.main:app --reload --port 8000
Terminal 3 – Asset service
cd asset-service && uv run python -m asset_service
Terminal 4 – App
pnpm --filter app dev
Optionally add billing-service (
cd billing-service && pnpm start:dev) and video-effects-service (cd video-effects-service && uv run python -m video_effects_service). -
Open http://localhost:3000. If the LangGraph server is elsewhere, set
NEXT_PUBLIC_LANGGRAPH_URLin the app env.
See deploy/README.md for full instructions. Key step: CI/CD does not copy service account files — you must manually provision them on the VM:
# Copy service accounts to VM (one-time setup)
gcloud compute scp secrets/google-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute scp secrets/firebase-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute ssh gemini-studio --zone=us-central1-a --command='
sudo mv /tmp/google-service-account.json /opt/gemini-studio/deploy/secrets/
sudo mv /tmp/firebase-service-account.json /opt/gemini-studio/deploy/secrets/
sudo chmod 644 /opt/gemini-studio/deploy/secrets/*.json
'GeminiStudio/
├── app/ # Next.js app (editor, chat, assets UI)
├── ai-sdk/ # Patched Vercel AI SDK (file-url in tool results for Gemini); see Credits
├── scene/ # Motion Canvas project (Vite)
├── scene-compiler/ # On-demand scene compilation (esbuild default, optional Vite; Express)
├── renderer/ # Render service (Express, BullMQ, headless bundle)
├── langgraph_server/ # LangGraph agent (FastAPI, Gemini 3, tools)
├── asset-service/ # Asset upload & pipeline (Gemini analysis, GCS, Firestore)
├── video-effects-service/ # Video effects (FastAPI, Replicate)
├── billing-service/ # Credits & billing (NestJS)
├── shared/ # Shared tool manifest (shared/tools/manifest.json)
├── deploy/ # Terraform, Caddy, docker-compose
├── package.json # Root pnpm workspace
├── pnpm-workspace.yaml
└── README.md # This file
Key areas:
| Area | Path |
|---|---|
| Agent & tools | langgraph_server/agent.py, langgraph_server/tools/ |
| AI SDK fork | ai-sdk/ (patched @ai-sdk/google for multimodal tool results; app uses file:../ai-sdk/packages/google) |
| Tool manifest | shared/tools/manifest.json |
| Scene compiler | scene-compiler/ (esbuild-based on-demand compilation with custom component injection; optional Vite via SCENE_COMPILER_ENGINE=vite) |
| Renderer | renderer/ |
| Scene | scene/ (Motion Canvas project, component registry, clip playback) |
| App | app/app/ |
Each service has its own README for setup and deployment.
Gemini Studio is built on top of incredible open-source projects and cloud services. We're grateful to the communities that made this possible.
- Motion Canvas — The rendering engine that makes code-first video editing possible. React-like TypeScript API perfect for LLM-generated animations. GitHub · Docs
- FFmpeg — Audio/video transcoding, merging, and encoding in the renderer. We use it via fluent-ffmpeg.
- Puppeteer — Headless Chrome for running Motion Canvas and exporting frame-perfect video in the cloud.
- Mediabunny — JavaScript library for reading, writing, and converting video and audio in the browser. Web-first media toolkit.
- Vercel AI SDK — React hooks and streaming for the chat UI. We use
ai,@ai-sdk/react, and@ai-sdk/googlefor the frontend agent experience. Multimodal tool results: The upstream SDK did not support sending video, image, or audio from tool results to Gemini (tool-returned files were serialized as JSON text, so the model never saw the media). We implementedfile-urlhandling in the Google provider so that when a tool returns a file (e.g. ourwatchVideo/watchAssettools), the model receives it as realfileDataand can see/hear the content. This is a significant contribution: it enables the “agent watches its own work” loop and any agent that returns media from tools. We ship a patched copy underai-sdk/(Apache 2.0, see ai-sdk/LICENSE) and plan to submit this as a PR to vercel/ai. - LangGraph — Agent orchestration and tool execution. Powers our conversational agent. GitHub
- Google Gemini 3 Pro — Reasoning, tool use, multimodal understanding (video, images, audio), and generative APIs (Veo, Imagen, Lyria, Chirp).
Used for background asset understanding and indexing (the agent’s live reasoning and multimodal understanding are powered by Gemini).
- Cloud Video Intelligence API — Background shot detection, label detection, and video understanding in the asset pipeline so the library is searchable.
- Cloud Speech-to-Text — Background transcription with word-level timestamps so assets are searchable and captions can be generated.
- Cloud Text-to-Speech — Narration and TTS (Chirp) integration.
- Cloud Storage — Asset and render output storage.
- Cloud Pub/Sub — Render completion and pipeline events.
- Firebase — Auth, Firestore (projects, metadata), and real-time sync.
- Algolia — Semantic and full-text search over the asset library.
- CloudConvert — Image and document conversion in the asset pipeline.
- Replicate — Video effects (e.g. background removal, chroma key) via the video-effects-service.
- Next.js — Web app and API routes.
- FastAPI — LangGraph server, asset-service, video-effects-service.
- BullMQ — Render job queue (Redis-backed).
- Automerge — CRDTs for collaborative timeline and branch sync.
Built by Younes Laaroussi
| Link | |
|---|---|
| Site | youneslaaroussi.ca |
| linkedin.com/in/younes-laaroussi | |
| 𝕏 | @younesistaken |
Elastic License 2.0 (ELv2) – See LICENSE.
You may use, copy, distribute, and make derivative works of the software. You may not offer it to third parties as a hosted or managed service (i.e. you cannot run “Gemini Studio as a service” for others). You must keep license and copyright notices intact and pass these terms on to anyone who receives the software from you.





