The build

How Conjure works.

Multi-agent live video room, end to end on Runway — and a debrief pipeline that keeps talking after you leave the room.

How it works

Built on Runway, end to end.

Most products that touch Runway use one endpoint. Conjure composes the catalog — Custom Avatars, Realtime Sessions, Avatar Videos, and three text-to-video models — into a single room that starts with a description and ends with a film.

Describe

Who's in the room?

Name the people. Drop a photo or paste their LinkedIn — Conjure extracts their behavior and the question they keep asking under pressure.

Claude · web_search

Summon

Three faces, sixty seconds.

A photo becomes a Runway Custom Avatar. A behavioral note becomes a personality prompt. Three avatars are forged in parallel and seated in the room.

Runway Custom Avatars

walk into the room

Three avatars. One coordinator.

Three live Realtime Sessions in one browser. A server-side coordinator routes your microphone to whoever should hear you, and relays cross-character context via tool calls.

Realtime Sessions · LiveKit · Deepgram

Reflect

The room keeps talking.

After the room: a narrated debrief, a single cinematic shot of the most charged moment, and a postmortem in which the personas talk about you while you’re not in the room.

Avatar Videos · Seedance 2 · Sonnet

The hard part

The coordinator.

Three avatars in a 2×2 grid is the easy part. Making them feel like a room — taking turns, knowing what the others said, reacting to each other — is the work. Conjure’s coordinator is the difference between three chatbots in a video call and a meeting you walked into.

Mic routing for turn-taking
Your microphone is attached to one avatar at a time via WebRTC track manipulation. Only that avatar can hear you. That single input gate produces natural turn-taking instead of a chorus.
Cross-character awareness via tool calls
Each avatar’s personality includes an instruction to call check_room_state before responding. The server returns what the other avatars have said since this one last spoke — plus a behavioral nudge — so each character reacts to the room, not just to you.
Live video, not text
Three concurrent WebRTC sessions render in your browser at once. No Zoom bot, no screen recording — the room is the page.

After the room

Three artifacts.

One pipeline, three different jobs. Each uses a different Runway primitive — and that’s deliberate.

Debrief

A narrated summary.

The lead persona reads back the room in their own voice — what landed, what to sharpen, what to bring next time.

Avatar Videos · Deepgram

Cinema

The most charged moment.

One Claude pass distills the conversation into a cinematographer's shot description. Runway Seedance renders it — eight seconds, ambient sound.

Seedance 2 · Sonnet

Postmortem

What they say after you leave.

A 60–90 second debrief between the three personas, recorded as you watch. They disagree about you on at least one specific moment. By design.

Avatar Videos · Sonnet · parallel renders

The stack

Runway

Custom Avatars
Realtime Sessions
Avatar Videos
Seedance 2 · Veo 3.1 · Gen-4.5

Anthropic

Claude Sonnet 4.6
Claude Haiku 4.5
web_search tool
JSON-mode prefill

Voice & video

Deepgram Aura 2 TTS
LiveKit WebRTC
Sequential clip playback
ffmpeg-static (concat)

Application

Next.js 16 (App Router)
Redis-backed state
Server-Sent Events
Vercel + Railway

The constraints

The room had limits.

Naming what was hard is more credible than pretending nothing was. Each of these is a real Runway / Anthropic API constraint we built around — not a bug list.

5-minute session cap

Runway Realtime Sessions hard-cap at 5 minutes. Conjure frames it as a feature: focused conversation, not open-ended chat. Every demo is built to land under 90 seconds of conversation.

3 concurrent personas

Tier 2 caps you at three concurrent video sessions. Three is the right number for a room — it's enough to feel populated, not so many that the user gets lost. The constraint shaped the product.

Voice cloning is incompatible with the live room

Runway disables webcam input AND tool calling when a custom voice is used on a session. Conjure depends on both. Voice cloning is therefore deferred to v0.2 — until either the API permits the combination or we architect around it.

Per-persona voices via Deepgram, not Runway

To give each persona a distinct in-character voice, the debrief and postmortem pipelines synthesize speech via Deepgram Aura 2 and pass the audio to Runway's avatar-video endpoint. Native per-persona Runway voices land when the constraint above is resolved.

Cross-character awareness costs ~1.5s per turn

Avatars cannot hear each other natively — only the user. Cross-character context flows through a server-side coordinator that returns a summary via the check_room_state tool call. Realistic per-turn budget is 1.0–1.5 seconds. Masked by deliberate visual 'thinking' beats on the listening tile.

No persistent personas in v0.1

The Circle stores personas in localStorage for the current browser. Tier 3 — persistent, refining personas across rooms — lands in v1.0; v0.1 keeps the scope tight.

The road

What’s next.

v0.1Now

Live

Multi-agent live room, three artifacts post-session (debrief, cinema, postmortem), persona library scoped to localStorage.

v0.2Public alpha

Voice cloning + real-photo personas

Once Runway resolves the voice + webcam + tool-calling constraint, custom voices unlock. Real-photo persona uploads with explicit consent flow.

v0.3Paid product

Calendar-aware rooms + iOS

The room knows about the meeting on your calendar tomorrow. iOS app for last-minute prep on the way to the room.

v1.0Team

Persistent personas — the Circle

Tier 3 fidelity: the people you regularly walk into rooms with are saved, refined, and improve across sessions. The product compounds.