An agent can read everything ffprobe reports: duration, codec, bitrate, keyframe positions. That is a set of coordinates on a map with no picture of the terrain. It does not know that 0–12s is a title card, that the cut it is about to make at 1:04 lands in the middle of a spoken sentence, or which scene the user actually cares about.
The obvious fix — render every frame and describe it — is correct in principle and unworkable in practice.
Describing each frame through a VLM is hundreds of API calls, minutes of latency, and thousands of near-duplicate descriptions — adjacent frames in a static shot are nearly identical, so most of that spend buys no new information.
Video is temporally redundant by design. LookingGlass exploits the structure the format already has, instead of paying to re-discover it frame by frame.
I-frames carry complete image data. ffprobe enumerates them without decoding. They are the cuts you can make with -c copy, no re-encode.
PySceneDetect compares adjacent-frame histograms to find the moments of real visual change. A 90s demo collapses to 8–12 scenes — each a natural unit of meaning.
Tile one representative frame per scene into a single image. The agent reads a whole clip in one inference call instead of 2,700 frames.
The context-sheet idea is published, validated prior art — IG-VLM (Kim et al., 2024), "An Image Grid Can Be Worth a Video." LookingGlass extends it from a research notebook to a deployable tool: scene-aware sampling instead of uniform partitioning, and a reusable artifact an editing agent can query.
L1 alone may settle an unambiguous cut. L2 gives the full visual picture for reordering. L3 fills gaps when the agent must reason about a specific region. L4 grounds an edit in object geometry — "cut right after the Save click." Running alongside all four, the acoustic track (FIG 17) transcribes any audio and pins it to the timeline, enriching every layer.
Built today: structural pass, scene detection, transcription + alignment, frame extraction, and tiling. With an injected labeler (Studio's claude -p), analyze also writes one audio-visual label per scene; with no labeler it stays inference-free. Proven on our test clip.
Tiered by design: the eager pass writes a one-line label per scene. Full per-scene descriptions are not eager — they're computed lazily and cached on drill-down (L3), so analyze stays fast and you pay for depth only where you ask.
Two ffprobe passes, no video decode. The first reads container and stream metadata; the second enumerates keyframe timestamps. The most agent-useful output is edit_constraints.safe_cut_points — the timestamps you can cut at with -c copy and no quality-degrading re-encode.
// vision.json — structural section { "structural": { "codec": "h264", "width": 1280, "height": 720, "fps": 30.0, "duration_seconds": 12.0, "nb_frames": 360 }, "edit_constraints": { "safe_cut_points": [0.0, 8.333] } }
real output — test_out/vision.json

real artifact · a random ~2-min clip pulled from YouTube · 38 scenes → 3 sheets · this is sheet 1, scenes 01–16. scene-id top-left, timecode bottom-left, burned per cell.
AdaptiveDetector (threshold 15) finds scene boundaries; the midpoint frame of each scene is extracted, scaled with Lanczos, and tiled into one image. Each cell carries its scene ID so the VLM can reference cells by name, and its timecode so the agent can map a description straight to an edit point.
The grid is descended from IG-VLM, but scene-aware rather than uniform: cells land on real visual changes, not arbitrary time slices.
Forcing a whole video onto one image shrinks each cell below the resolution a VLM can read (IG-VLM's "more frames hurts"). So sheet count is derived, not fixed: higher fidelity → fewer cells per sheet → more sheets.
| preset | cell | cells / sheet | use |
|---|---|---|---|
| low | 180 | 196 | bulk scanning |
| medium | 320 | 64 | standard (default) |
| high | 480 | 25 | UI text matters |
| ultra | 720 | 9 | fine detail |
Sheets are sequence-named with their timecode range, so lexical sort matches temporal order and an agent can pick a sheet without opening a file.
live in Studio — the resolution slider + a min-scene-length control regenerate sheets on the fly



real run · the test clip (2:00) → 38 scenes → 3 sheets at medium/320px, min-scene 1.0s
Understanding is tiered: every scene already carries a cheap one-line label from analyze (L2). Layer 3 is the lazy deep tier — and describe always runs on a zoom, not one frame, because a single midpoint frame breaks down two ways: cells get downscaled until fine detail is illegible, and one frame can't show what changes across a multi-second scene. It also gets the scene's transcript (§Acoustic), and the result is cached.
The temporal zoom is Layer 2's tiling applied recursively, on a known window, so it needs no scene detection. Take the scene, pad it by a neighbor each side, uniformly sample N frames across that window, tile them at higher resolution. Fewer cells → more pixels each; several frames → motion recovered; neighbor padding catches mislabeled boundaries.
# agent-callable — prints a path to read lookingglass zoom --input vision.json \ --scene scene-12 --context 1 --frames 9

real zoom · the test clip, scene-23 ±1 neighbor · 9 frames, 00:01:07→00:01:14 · uniform sampling, no detection. f5 is a hard cut to a road shot — exactly the within-scene change the single macro frame misses. Built end to end: vision.zoom + CLI + Studio.
Grounding DINO does open-vocabulary detection from a natural-language query; SAM 2 returns tracked masks. Together they turn "cut just after the user clicks Save" from a guess into a grounded edit.
The cardinal rule: annotations are metadata, not duplicated images. Detections are cached as a .objects.json sidecar beside the clean sheet; the overlay is rasterized on demand only when an annotated view is requested. Storage stays compact and the clean sheet stays canonical.
sidecar holds geometry · overlay drawn on the clean sheet only when asked
Inline a full description and object masks for every scene into vision.json, and loading that file becomes the very token and compute burden LookingGlass exists to remove. You'd have built an expensive way to make video understanding expensive again.
The discipline, then: small by default, deep on demand, never compute the same thing twice.
LookingGlass behaves like a finder over a video — build a map once, then every later request is a search or a drill-down. The data layer is foundational: it is stood up before any layer feature code, because every layer writes through it.
A perceptual hash of each representative frame, clustered by Hamming distance, folds near-duplicate frames into one canonical visual state. A login screen revisited four times is one state, embedded once.
A "what's here / find X" query embeds the text into the same SigLIP 2 space and brute-force cosine-ranks the state vectors in NumPy. Below ~10k vectors a vector database is unnecessary. The only model call is the text embed, and that is cached by normalized query string.
It returns matching scenes, timecodes, the nearest safe cut, a coarse label, and a confidence score.
key = xxh3(source_video) : canonical_state_dhash : op # describe | segment : model_version [ : params ]
Because dedup maps every near-identical frame to one state, a description or mask is computed once per state and reused by every scene that shares it. The source content-hash namespaces every key, so editing the source invalidates transparently — content-addressed, not mtime.
L1 and L2 are cheap and local — FFmpeg, PySceneDetect, Pillow, NumPy. That is the entire current requirements.txt, and it runs anywhere FFmpeg does.
The map's SigLIP 2 embeddings and especially L4 Grounded SAM 2 are different: PyTorch, optional CUDA, ~600 MB of weights, and 30–60s per clip on CPU. Requiring all of that just to tile a sheet would defeat the lightweight core.
So the heavy layers run through a swappable compute backend. The boundary is the detect / segment / embed contract, not the CLI. An agent that never asks for L4 never installs PyTorch; a host with no GPU still reaches L4 remotely.
compute.backend in config · infra detail in PRD §8.5
Everything an agent needs to understand a clip before issuing a command, in one file: the structural profile, safe_cut_points, the sheet manifest with time ranges, and the scene list with labels, timecodes, and the nearest safe cut for each.
Heading toward a tiered shape: a compact index plus addressable detail records, so loading the map never re-pays for the depth.
{
"edit_constraints": { "safe_cut_points": […] },
"context_sheets": [
{ "id":"sheet-001", "time_range":{…},
"scenes_included":["scene-01",…] }
],
"scenes": [
{ "id":"scene-04", "label":"New Test Form",
"start_timecode":"00:00:14.000",
"nearest_safe_cut":12.012 }
],
"summary": { "analysis_cost":{ "estimated_tokens_used":14800 } }
}
# the agent reasons over vision.json "Remove the intro (scene-01), cut straight to the walkthrough (scene-03). Nearest safe cut to scene-03 start is 12.012s — stream-copy, no re-encode." # then emits the command ffmpeg -ss 12.012 -i demo.mp4 -c copy out.mp4
scene IDs and safe cuts turn understanding into a re-encode-free edit
Every operation is a structured command an agent can emit:
Built today: analyze, zoom, and query with --safe-cuts, --scenes, --scene-at. Studio drives all of it through claude -p.
| operation | calls | est. cost |
|---|---|---|
| 1 sheet · medium | 1 | ~$0.015 |
| 3 sheets · high | 3 | ~$0.045 |
| full-frame · 1 scene | 1 | ~$0.015 |
| full-frame · all 18 | 18 | ~$0.27 |
| L4 spatial pass | 0 API | compute only |
For the common case — 1–3 sheets plus a few targeted descriptions — total inference is single-digit pennies per video. The budget is sized to the strictest image limit (Opus); the default model stays Sonnet for cost.
Originally visual-only — but on real footage the dialogue carries half the meaning. When a clip has an audio stream, LookingGlass transcribes it once with faster-whisper (an opt-in extra, lazy-loaded), then aligns each transcript segment to a scene by timecode overlap. Cheap, because the temporal map already exists.
It's not a deeper tier — it's a second input track that enriches the map. Each scene gains a transcript, which feeds the eager labels and the deep describe, so the VLM knows what a scene is about, not only what it looks like. Spoken content becomes searchable too.
[audio] extra, never imported otherwiseWe grabbed a random ~2-minute YouTube clip (Hitched) and ran the whole pipeline on it. The acoustic track transcribed the audio and pinned every line of dialogue to the scene timeline.
The clearest proof: a phone conversation correctly spans the cut between two consecutive shots — the alignment the map makes free. Visual-only labels landed around 70%; adding "what's said" is what closes the gap toward knowing what the clip is actually about.
Parked open questions: audio-aware scene boundaries · waveform sidecars · K-frames "best frame" selection · frames_per_scene > 1 · cross-clip project index.
Cost scales with the unique visual states in a clip and the depth an agent actually requests — not with how long the clip is, and not with how many times the agent asks. That is what lets an agent issue effectively unlimited finder calls against a video without the system getting heavier.