Skip to content

Video Editing at the Speed of Thought. Tell Gemini what you want—cuts, captions, transitions, effects—and watch your timeline update instantly.

License

Notifications You must be signed in to change notification settings

youneslaaroussi/GeminiStudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

351 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemini Studio

Gemini Studio Logo

The Execution Layer for Agentic Video

Generative AI solved pixel generation. We're solving video production.

The deterministic engine that gives AI agents the hands to edit video.

Built for the Gemini 3 Hackathon

Next.js Gemini 3 Pro LangGraph Motion Canvas TypeScript Python License

Architecture


The Bet: In 2 years, manual video editing will be obsolete for 90% of use cases. The bottleneck isn't AI models—it's the lack of infrastructure that lets agents actually edit. We're building that missing layer.

The Moat: Gemini Studio is the first video editor that understands your assets and clips semantically. The system organizes everything for you—no more renaming each file by hand or wasting hours in bins. Search your library in plain language (e.g. the drone shot over the water, the clip where the crowd cheers); the agent uses the same understanding to resolve which asset or clip you mean. Semantic understanding turns natural language into precise edits.


Demo & submission

Item Link
Live demo https://www.geminivideo.studio/
Repository https://github.com/youneslaaroussi/geministudio

Table of Contents


The Problem: AI Can Generate Pixels, But It Can't Produce Video

Veo solved generation. But it didn't solve production.

A raw AI-generated clip is not a finished video. It has no narrative structure, no pacing, no intent. The bottleneck isn't the model—it's the lack of a rendering engine that can translate an agent's text-based intent into a frame-perfect video edit.

We're moving from "Tools for Editors" (e.g. Premiere Pro) to "Directors for Agents."


The Solution: An Execution Layer for AI Agents

Gemini Studio is the infrastructure that gives AI agents hands.

We built the deterministic engine that allows an agent to:

  • Ingest raw footage (screen recordings, generated clips, uploads)
  • Organize & search — Assets and clips are indexed by content. No manual renaming; find anything by describing it (the wide shot, that B-roll of the product). The agent uses the same semantic layer for both assets and clips.
  • Understand your intent—that title card, the drone shot, zoom in on the error—no file names, no hunting
  • Execute the edit programmatically—frame-perfect, no human in the loop

This isn't a chatbot wrapper. The agent has real agency: it calls the renderer, manipulates the timeline, triggers Veo 3/Nano Banana Pro/Lyria/Chirp generation, and proactively notifies you when your video is ready. Gemini 3 Pro becomes the reasoning layer for the entire production stack.

The result: Video creation transforms from a manual craft into a scalable API call.


Why Not a Plugin? A Ground-Up Redesign

You can't bolt agentic capabilities onto software built for humans.

A Premiere Pro plugin is fundamentally limited: it automates UI clicks, parses menus, and simulates mouse movements. It's brittle, slow, and constrained by the editor's human-centric architecture. The agent is a guest in someone else's house, following rules it didn't write.

Gemini Studio is built from the ground up with agent-native tools and code.

Agent-Native Architecture

Aspect Plugin Approach Gemini Studio (Ground-Up)
Timeline Control UI automation, brittle clicks Programmatic API—deterministic, version-controlled
Asset Resolution File paths, manual matching Semantic understanding—resolve by meaning, not filename
Rendering Export dialogs, progress bars Headless renderer—API-driven, event-based
State Management Screen scraping, guessing Native state—the agent knows exactly what's on the timeline
Iteration Can't watch its own work Full loop—agent analyzes output and iterates autonomously
Branching Impossible—single timeline Git-style branches—agent edits on branches, you merge

What Only a Ground-Up Design Enables

1. Semantic Asset Resolution Plugins can't change how Premiere indexes assets. We built semantic understanding into the core—every upload is analyzed, every clip is searchable by content. The agent doesn't need file paths; it resolves "the drone shot" by understanding what's in your library.

2. Deterministic Rendering Plugins trigger exports through UI dialogs. Our renderer is headless and API-driven—the agent calls render() with exact parameters, gets events on completion, and can iterate without human intervention.

3. Version-Controlled Timelines Traditional editors have one timeline state. We built branching into the architecture—the agent edits on a branch, you review and merge. This requires ground-up state management that plugins can't provide.

4. Agent-Native Tools Our 30+ tools aren't wrappers around UI actions—they're first-class operations designed for programmatic control. add_clip(), apply_transition(), search_assets(), createComponent() are deterministic functions the agent calls directly, not UI simulations.

5. Autonomous Iteration The agent can watch its own renders, critique them, and adjust—a capability that requires tight integration between rendering, analysis, and timeline manipulation. Plugins can't close this loop.

The code is agent-native. Every component—from asset ingestion to final render—is designed for programmatic control. The agent isn't simulating a human editor; it's using tools built for it.


Why This Changes Everything

Vibe Editing

We had vibe coding. Now we have vibe editing. Describe the feeling you want. "Make it punchy." "Slow it down for drama." "Add energy to this section." "Give me a typewriter intro with a glitch on every 5th character." The agent understands vibes and translates them into concrete editing decisions—cuts, zooms, pacing, transitions—and even writes custom motion graphics from scratch when nothing in a template library would do.

The Cursor for Video Editing

Just like Cursor revolutionized coding by letting AI agents write alongside you, Gemini Studio lets AI agents edit alongside you. Same project. Same timeline. Human and agent, co-directing in real-time. And just like Cursor, the agent doesn't just autocomplete—it writes entire components, previews them, and iterates.

Git-Style Branching for Video

Your timeline is version-controlled. The cloud agent edits on a branch. You review the changes. Merge what you like, discard what you don't. Split timelines, experiment freely, sync seamlessly.

Feature What It Enables
Semantic assets & clips No manual renaming—the system organizes and indexes by content. Search your library in plain language; refer to assets and clips by what they are. The agent resolves "the drone shot," "that B-roll," etc. by meaning, not filename.
Vibe Editing Intent-based editing ("make it cinematic," "add energy here")
Programmatic motion graphics Agent writes freeform Motion Canvas components—no template ceiling. Typewriter text, animated charts, branded overlays, anything describable
Real-time Sync Agent edits appear live in your timeline
Branching Non-destructive experimentation
Merge/Split Combine agent work with your own edits

This isn't automation. This is collaboration between human directors and AI agents. Gemini Studio is the code-to-video layer—where vibe editing meets programmatic rendering.


Gemini 3 Pro: The Reasoning Layer

Gemini 3 Pro isn't just integrated—it's the brain that makes agentic video possible. We leverage its state-of-the-art reasoning and native multimodal understanding to power every layer of the stack.

Agent Brain (LangGraph + Gemini 3 Pro) Every interaction flows through Gemini 3 Pro. It reasons over project state, decides which tools to invoke, and orchestrates the entire editing pipeline. We use dynamic thinking_level to balance reasoning depth with response latency. Without Gemini 3 Pro's reasoning and tool use, there is no execution layer—only a traditional UI waiting for human input.

Multimodal Understanding (1M Token Context Window) The agent doesn't just receive text—it sees and hears. Gemini 3 Pro can comprehend video, images, and audio natively through its 1 million token context window. We use the media_resolution parameter to optimize token usage while maintaining fidelity for scene detection, object recognition, and transcription.

Asset & Clip Intelligence Pipeline Every uploaded asset and every clip goes through Gemini 3 Pro's multimodal analysis:

  • Scene Detection — Automatic boundary identification using native video understanding
  • Object Recognition — Context-aware detection throughout the video
  • Speech Transcription — Full audio-to-text with word-level timestamps
  • Semantic Understanding — High-level analysis ("what's happening here?")

The system organizes your library by content—no renaming files by hand. Search assets and clips in plain language; the agent uses the same indexing. It doesn't just know that you have a video—it knows what's in it, frame by frame. This is the moat: you say put the title card over the drone shot and the agent resolves which image and which clip by meaning, not by filename. No other editor does this.

Asset Pipeline

Generative Pipeline (Veo 3, Nano Banana Pro, Lyria, Chirp) The agent doesn't just edit existing footage—it creates. Need b-roll? Veo 3. Need a thumbnail? Nano Banana Pro. Need background music? Lyria. Need narration? Chirp TTS. These aren't add-ons; they're first-class tools the agent invokes autonomously based on narrative intent.

The Stack:

Layer Role
Gemini 3 Pro Reasoning + tool orchestration + multimodal understanding
Files API Media upload and processing
Veo 3 / Nano Banana Pro / Lyria / Chirp Generative media creation
Motion Canvas Deterministic frame-perfect rendering

This is the full loop: ingest → perceive → reason → generate → render.


Components

Component Tech Port (default) README
app Next.js 3000 app/README.md
langgraph_server FastAPI, LangGraph, Gemini 8000 langgraph_server/README.md
Telegram agent Same LangGraph server, webhook langgraph_server/README.md
asset-service FastAPI, GCS, Firestore 8081 asset-service/README.md
renderer Express, BullMQ, Puppeteer, FFmpeg 4000 renderer/README.md
scene Motion Canvas, Vite (build only)
scene-compiler esbuild (default), optional Vite 4001 See Scene Compiler
video-effects-service FastAPI, Replicate video-effects-service/README.md
billing-service NestJS, Firebase billing-service/README.md

How the Execution Layer Works

Request Flow

1. Agent Receives Intent — User speaks naturally (web or Telegram). The Gemini 3 Pro agent parses the request and plans the execution.

2. Tools Execute Autonomously — The agent invokes 30+ tools: timeline manipulation, asset search, Veo generation, image creation, TTS, and custom component creation. Each tool is a deterministic operation the agent controls.

3. Renderer Produces Output — Motion Canvas renders the final video headlessly—pixel-perfect, production-ready. Pub/Sub events notify the agent on completion.

4. Agent Closes the Loop — "Your video is ready." The agent proactively informs the user. No polling. No waiting. Full autonomy.


Motion Canvas: The Perfect Rendering Layer for LLMs

Motion Canvas is the secret sauce that makes agentic video editing possible.

We chose Motion Canvas as our rendering engine because it's built for code-first animation—exactly what LLMs excel at. Unlike traditional video editors that require UI automation, Motion Canvas uses React-like TypeScript code that agents can generate naturally.

Why Motion Canvas is Perfect for AI Agents

Code-First Architecture Motion Canvas animations are written as TypeScript generator functions. The agent doesn't simulate clicks or drags—it writes code:

export default makeScene2D(function* (view) {
  const circle = createRef<Circle>();
  view.add(<Circle ref={circle} width={320} height={320} fill={'blue'} />);
  yield* circle().scale(2, 0.3);
  yield* circle().fill('green', 0.3);
});

This is exactly what LLMs are trained to do: generate code. The agent can compose complex animations, transitions, and effects by writing TypeScript—a task it's already excellent at.

Multimodal Capabilities Motion Canvas integrates seamlessly with Gemini's multimodal understanding. The agent can:

  • Analyze video frames to understand composition
  • Generate code that matches visual intent
  • Iterate by watching renders and adjusting code
  • Compose complex scenes with multiple layers, effects, and transitions

Deterministic & Headless Motion Canvas renders deterministically—same code, same output, every time. Combined with Puppeteer, we can render headlessly in the cloud. The agent calls render(), gets pixel-perfect output, and can iterate without human intervention.

Production-Ready Motion Canvas isn't a prototype—it's battle-tested for production-quality animations. The agent generates code that produces broadcast-ready video, not experimental output.

The Result: Motion Canvas turns video editing from a visual craft into a coding problem—and coding is what LLMs do best. The agent writes TypeScript, Motion Canvas renders it, and you get professional video. This is why agentic video editing works: we're using the right tool for the job.


Programmatic Motion Graphics: The Agent Writes Components, Not Templates

We had vibe coding. Now we have vibe editing. Gemini Studio is the code-to-video layer.

Most "AI video tools" give you a fixed library of templates and effects. Gemini Studio does something fundamentally different: the agent writes real Motion Canvas components from scratch—freeform TypeScript code with signals, generators, tweens, and the full animation runtime. There is no template ceiling. If you can describe it, the agent can build it.

How It Works

When you say "make me a typewriter animation" or "add a progress ring that fills to 75%", the agent doesn't select from a menu. It writes a complete Motion Canvas component:

export class TypewriterText extends Node {
  @initial('Hello World') @signal()
  public declare readonly fullText: SimpleSignal<string, this>;

  @initial(0.05) @signal()
  public declare readonly charDelay: SimpleSignal<number, this>;

  private readonly progress = createSignal(0);

  public constructor(props?: TypewriterTextProps) {
    super({ ...props });
    this.add(
      <Txt
        text={() => this.fullText().slice(0, Math.floor(
          this.progress() * this.fullText().length
        ))}
        fill={'#ffffff'}
        fontSize={48}
        fontFamily={'JetBrains Mono'}
      />
    );
  }

  public *reveal(duration?: number) {
    yield* this.progress(1, duration ?? this.fullText().length * this.charDelay());
  }
}

This component is compiled on the fly, hot-loaded into the live preview, and rendered to final video—all without the user touching code. The agent can also iterate: watch the result, adjust timing, change easing, add effects, and recompile.

The Pipeline

Agent writes TSX → Scene Compiler builds it → Preview renders live → Renderer exports final video
Stage What Happens
Create Agent calls createComponent with full TSX code, input definitions, and a class name
Compile Scene Compiler service compiles the component into the scene bundle (esbuild by default; see below). A barrel file is auto-generated—no manual registration needed
Preview ScenePlayer detects the new component asset, recompiles, and renders it live in the browser at 30fps. Changes appear in real time
Control Input definitions (inputDefs) surface as controls in the timeline inspector. Users tweak values (text, color, speed, size) without code
Render The renderer compiles with the same component files and exports production-quality video via headless Puppeteer + FFmpeg

Scene Compiler: esbuild for Millisecond-Fast Compilation

The scene-compiler service turns Motion Canvas TypeScript (project + scenes + custom components) into a single JavaScript bundle that the preview and renderer load. It used to use Vite with the Motion Canvas plugin; we now use esbuild by default for roughly 25× faster cold compiles and instant cache hits.

Scenario Before (Vite) After (esbuild)
Cold compile ~3.5 s ~130 ms
Same inputs (cache hit) ~3.5 s 0 ms
Custom component change ~3.5 s ~70 ms

How it works: The compiler replicates the Motion Canvas Vite plugin behavior (e.g. ?scene wrappers, .meta and .glsl handling, virtual:settings.meta, custom component injection) in a custom esbuild plugin. Because esbuild strips URL query suffixes like ?scene before plugin callbacks see them, we read and rewrite project.ts at build time so scene imports use a custom __mc_scene__ suffix that the plugin can resolve. The result is the same bundle the Vite pipeline produced, with a small in-memory LRU cache (50 entries, 5 min TTL) so repeated compiles with identical inputs return immediately.

Rollback: To use the original Vite-based compiler, set SCENE_COMPILER_ENGINE=vite in the scene-compiler environment. No code changes required.

  • Port: 4001
  • Config: scene-compiler/ (e.g. BASE_SCENE_DIR, SCENE_COMPILER_SHARED_SECRET)

Inputs vs. Animation

Component inputs (inputDefs) are static values set on the timeline—text content, colors, sizes, speeds. They don't animate. For temporal behavior (typewriter reveals, counters, progress bars, pulsing effects), the agent uses generator methods and signals inside the component. Inputs control what and how fast; generators and signals create the motion.

This is a deliberate design: it gives users simple controls while the agent handles the complexity of animation code.

What This Enables

Capability Example
Freeform motion graphics Progress rings, stat counters, animated charts, lower thirds—anything describable
Branded overlays Custom components with your colors, fonts, and layout
Data-driven visuals Components that take numbers/text as inputs and animate them
Iterative refinement Agent previews its own component, adjusts timing/easing, recompiles
Zero-template ceiling No fixed library. The agent writes new components for every request

Component plugins: data viz, maps, color, and noise

The scene compiler bundles first-class plugins so the agent can generate components that go far beyond static shapes and text. Every plugin is compute-then-render: the library produces data (path strings, positions, colors, numbers); Motion Canvas renders it. No DOM, no charting framework—just the right primitives for AI-generated video.

Plugin What it does What the agent can build
d3-geo Geographic projections and SVG path strings from GeoJSON Animated maps, region highlights, country/continent outlines
d3-shape Arc, pie, line, and area path generators Pie charts, donuts, line charts, area charts, animated data series
d3-scale Map data domains to pixel ranges and color scales Bar positions, axes, data-driven colors and sizes
d3-hierarchy Tree, treemap, and pack layout algorithms Tree diagrams, treemaps, sunbursts, nested visualizations
simplex-noise Procedural 2D/3D/4D noise Organic motion, flowing backgrounds, terrain, generative-style effects
chroma-js Color scales, interpolation (LAB/LCH), lighten/darken/saturate Palettes, data-driven colors, accessible contrast, brand-matched fills

The in-app Monaco editor is wired with TypeScript types for all of these: when the agent (or you) edits a component, IntelliSense and type checking work for d3-shape, d3-scale, chroma, and the rest. Result: the agent can say "animate a bar chart from this data", "draw a map of Europe and highlight Germany", or "add a smooth noise-based background" and produce real, compilable code with full editor support.

This is what separates Gemini Studio from every other AI video tool. Others wrap a fixed API. We have a real runtime underneath that the AI programs against. The agent doesn't pick from a menu—it writes code, compiles it, previews it, and ships it.


Autonomous Video Production: The Agent Can Watch Its Own Work

This is the moat. Gemini Studio is the first platform where an AI agent can autonomously iterate on video edits without human intervention.

The Iteration Loop

Agent Iteration Loop

Why This Matters

Traditional AI video tools: Generate → Done. No feedback loop. No iteration.

Gemini Studio: Generate → Watch → Critique → Adjust → Repeat → Deliver.

The agent has:

  • Eyes (Gemini multimodal can analyze video content)
  • Hands (30+ tools for timeline manipulation, component creation, and media generation)
  • Creativity (writes custom motion graphics from scratch—no template ceiling)
  • Judgment (can evaluate pacing, cuts, transitions, and component design)
  • Memory (maintains context across iterations)

This is the difference between a tool that produces output and an agent that produces quality output.

Live Voice Chat: Real-Time Collaboration

Gemini Studio includes Live API integration for real-time voice conversations with the AI agent. The agent can execute tools, manipulate your timeline, and even "see" previews of your work—all through natural voice commands.

Current: When you ask the agent to watch your video, it extracts key frames using Mediabunny and analyzes them to understand the composition.

Future Vision: The Live API will continuously stream what you see—the same preview panel, at 5x playback speed—directly to the agent. This creates a true "pair editing" experience where the AI watches alongside you in real-time, ready to act on voice commands like "that cut was too early" or "add a transition here". The agent becomes a co-director who sees exactly what you see.

Render Quality Controls

The agent intelligently chooses render settings based on intent:

Mode Settings Use Case
Preview quality='low', fps=15, range=[start,end] Fast iteration, reviewing segments
Draft quality='web', full timeline Near-final review
Production quality='studio', full timeline Final delivery

Tech stack

Infrastructure

Category Technology Purpose
Frontend Next.js, React Web app, timeline editor, chat UI
Agent LangGraph, Gemini Conversational agent, tools
Render Motion Canvas, Puppeteer Headless video composition
Queue BullMQ, Redis Render job queue
Backend FastAPI (Python) LangGraph server, asset service, video-effects-service
Storage GCS, Firestore Assets, metadata, projects
Events Google Cloud Pub/Sub Render completion, pipeline events
Auth Firebase Auth, projects, chat sessions
Monorepo pnpm workspaces app, scene, renderer, shared

The codebase is a pnpm monorepo with TypeScript (app, scene, renderer) and Python (langgraph_server, asset-service, video-effects-service). The LangGraph server and asset service include tests; the agent and tools are typed and documented for maintainability.


Setup

Prerequisites

  • Node.js 20+, pnpm 9 (corepack enable pnpm)
  • Python 3.11+ (e.g. uv or pip)
  • Redis (for the renderer queue)
  • Google Cloud – GCS, optional Pub/Sub, Firebase
  • Chrome or Chromium (for the renderer)

Install

git clone https://github.com/youneslaaroussi/geministudio
cd geministudio
pnpm install
pnpm --filter @gemini-studio/scene run build
pnpm --filter @gemini-studio/renderer run build:headless

Environment

Copy the example env file for each service you run; set API keys and URLs. Details are in each service’s README.

Service Config
App app/env.templateapp/.env.local
LangGraph langgraph_server/.env.examplelanggraph_server/.env
Renderer REDIS_URL (and optional Pub/Sub) in renderer/
Asset service asset-service/.env.exampleasset-service/.env

Run locally

Option A: All services at once (recommended)

Requires Overmind (brew install overmind). Starts all 6 services in a unified TUI with per-service panes, logs, and controls.

pnpm dev
Command Description
pnpm dev Start all services (Overmind TUI)
pnpm dev:connect Connect to the Overmind session
pnpm dev:restart Restart all or specific services
pnpm dev:stop Stop all services
pnpm dev:status Show service status

In the Overmind TUI: press a service's key (e.g. 1 for app, 2 for renderer) to focus its pane; q to quit.

Option B: Without Overmind

If Overmind isn't installed, use the simple concurrent runner:

pnpm dev:simple

Option C: Separate terminals

  1. Start Redis.

  2. In separate terminals, start each service:

    Terminal 1 – Renderer

    pnpm --filter @gemini-studio/renderer dev

    Terminal 2 – LangGraph

    cd langgraph_server && uv run uvicorn langgraph_server.main:app --reload --port 8000

    Terminal 3 – Asset service

    cd asset-service && uv run python -m asset_service

    Terminal 4 – App

    pnpm --filter app dev

    Optionally add billing-service (cd billing-service && pnpm start:dev) and video-effects-service (cd video-effects-service && uv run python -m video_effects_service).

  3. Open http://localhost:3000. If the LangGraph server is elsewhere, set NEXT_PUBLIC_LANGGRAPH_URL in the app env.

Production Deployment

See deploy/README.md for full instructions. Key step: CI/CD does not copy service account files — you must manually provision them on the VM:

# Copy service accounts to VM (one-time setup)
gcloud compute scp secrets/google-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute scp secrets/firebase-service-account.json gemini-studio:/tmp/ --zone=us-central1-a
gcloud compute ssh gemini-studio --zone=us-central1-a --command='
  sudo mv /tmp/google-service-account.json /opt/gemini-studio/deploy/secrets/
  sudo mv /tmp/firebase-service-account.json /opt/gemini-studio/deploy/secrets/
  sudo chmod 644 /opt/gemini-studio/deploy/secrets/*.json
'

Repository structure

GeminiStudio/
├── app/                    # Next.js app (editor, chat, assets UI)
├── ai-sdk/                 # Patched Vercel AI SDK (file-url in tool results for Gemini); see Credits
├── scene/                  # Motion Canvas project (Vite)
├── scene-compiler/         # On-demand scene compilation (esbuild default, optional Vite; Express)
├── renderer/               # Render service (Express, BullMQ, headless bundle)
├── langgraph_server/       # LangGraph agent (FastAPI, Gemini 3, tools)
├── asset-service/          # Asset upload & pipeline (Gemini analysis, GCS, Firestore)
├── video-effects-service/  # Video effects (FastAPI, Replicate)
├── billing-service/        # Credits & billing (NestJS)
├── shared/                 # Shared tool manifest (shared/tools/manifest.json)
├── deploy/                 # Terraform, Caddy, docker-compose
├── package.json            # Root pnpm workspace
├── pnpm-workspace.yaml
└── README.md               # This file

Key areas:

Area Path
Agent & tools langgraph_server/agent.py, langgraph_server/tools/
AI SDK fork ai-sdk/ (patched @ai-sdk/google for multimodal tool results; app uses file:../ai-sdk/packages/google)
Tool manifest shared/tools/manifest.json
Scene compiler scene-compiler/ (esbuild-based on-demand compilation with custom component injection; optional Vite via SCENE_COMPILER_ENGINE=vite)
Renderer renderer/
Scene scene/ (Motion Canvas project, component registry, clip playback)
App app/app/

Each service has its own README for setup and deployment.


Credits & Resources

Gemini Studio is built on top of incredible open-source projects and cloud services. We're grateful to the communities that made this possible.

Rendering & Media

  • Motion Canvas — The rendering engine that makes code-first video editing possible. React-like TypeScript API perfect for LLM-generated animations. GitHub · Docs
  • FFmpeg — Audio/video transcoding, merging, and encoding in the renderer. We use it via fluent-ffmpeg.
  • Puppeteer — Headless Chrome for running Motion Canvas and exporting frame-perfect video in the cloud.
  • Mediabunny — JavaScript library for reading, writing, and converting video and audio in the browser. Web-first media toolkit.

AI & Agent

  • Vercel AI SDK — React hooks and streaming for the chat UI. We use ai, @ai-sdk/react, and @ai-sdk/google for the frontend agent experience. Multimodal tool results: The upstream SDK did not support sending video, image, or audio from tool results to Gemini (tool-returned files were serialized as JSON text, so the model never saw the media). We implemented file-url handling in the Google provider so that when a tool returns a file (e.g. our watchVideo / watchAsset tools), the model receives it as real fileData and can see/hear the content. This is a significant contribution: it enables the “agent watches its own work” loop and any agent that returns media from tools. We ship a patched copy under ai-sdk/ (Apache 2.0, see ai-sdk/LICENSE) and plan to submit this as a PR to vercel/ai.
  • LangGraph — Agent orchestration and tool execution. Powers our conversational agent. GitHub
  • Google Gemini 3 Pro — Reasoning, tool use, multimodal understanding (video, images, audio), and generative APIs (Veo, Imagen, Lyria, Chirp).

Google Cloud & Firebase

Used for background asset understanding and indexing (the agent’s live reasoning and multimodal understanding are powered by Gemini).

  • Cloud Video Intelligence API — Background shot detection, label detection, and video understanding in the asset pipeline so the library is searchable.
  • Cloud Speech-to-Text — Background transcription with word-level timestamps so assets are searchable and captions can be generated.
  • Cloud Text-to-Speech — Narration and TTS (Chirp) integration.
  • Cloud Storage — Asset and render output storage.
  • Cloud Pub/Sub — Render completion and pipeline events.
  • Firebase — Auth, Firestore (projects, metadata), and real-time sync.

Search, Conversion & Effects

  • Algolia — Semantic and full-text search over the asset library.
  • CloudConvert — Image and document conversion in the asset pipeline.
  • Replicate — Video effects (e.g. background removal, chroma key) via the video-effects-service.

Frameworks & Infrastructure

  • Next.js — Web app and API routes.
  • FastAPI — LangGraph server, asset-service, video-effects-service.
  • BullMQ — Render job queue (Redis-backed).
  • Automerge — CRDTs for collaborative timeline and branch sync.

Link
Site youneslaaroussi.ca
LinkedIn linkedin.com/in/younes-laaroussi
𝕏 @younesistaken

License

Elastic License 2.0 (ELv2) – See LICENSE.

You may use, copy, distribute, and make derivative works of the software. You may not offer it to third parties as a hosted or managed service (i.e. you cannot run “Gemini Studio as a service” for others). You must keep license and copyright notices intact and pass these terms on to anyone who receives the software from you.

About

Video Editing at the Speed of Thought. Tell Gemini what you want—cuts, captions, transitions, effects—and watch your timeline update instantly.

Topics

Resources

License

Stars

Watchers

Forks