lg@architecture:~$ cat looking_glass.spec ▍

LookingGlass

Temporal-visual context condensation tool for video agents. A compact map of a clip a Vision-Language Model (VLM) can read before it issues a single FFmpeg command.

Subject

video understanding layer

Version

0.1 · MVP

Built

L1 · L2 · L3

Output

vision.json

RUNNING 4-LAYER ARCH FFMPEG-NATIVE CONTENT-ADDRESSED CACHE

FIG 01/20

▸01 /

The Problem

Agents editing video are temporally aware but visually blind.

An agent can read everything ffprobe reports: duration, codec, bitrate, keyframe positions. That is a set of coordinates on a map with no picture of the terrain. It does not know that 0–12s is a title card, that the cut it is about to make at 1:04 lands in the middle of a spoken sentence, or which scene the user actually cares about.

The obvious fix — render every frame and describe it — is correct in principle and unworkable in practice.

the_naive_cost!

2,700 frames in a 90-second
clip at 30fps

Describing each frame through a VLM is hundreds of API calls, minutes of latency, and thousands of near-duplicate descriptions — adjacent frames in a static shot are nearly identical, so most of that spend buys no new information.

~hundreds callsminutes wait$$$ / clip

FIG 02/20

▸02 /

The Wedge

Spend the minimum visual signal for the maximum editing understanding.

Video is temporally redundant by design. LookingGlass exploits the structure the format already has, instead of paying to re-discover it frame by frame.

keyframes

I-frame anchors

I-frames carry complete image data. ffprobe enumerates them without decoding. They are the cuts you can make with -c copy, no re-encode.

scene_boundaries

Units of change

PySceneDetect compares adjacent-frame histograms to find the moments of real visual change. A 90s demo collapses to 8–12 scenes — each a natural unit of meaning.

context_sheet

The insight

Tile one representative frame per scene into a single image. The agent reads a whole clip in one inference call instead of 2,700 frames.

The context-sheet idea is published, validated prior art — IG-VLM (Kim et al., 2024), "An Image Grid Can Be Worth a Video." LookingGlass extends it from a research notebook to a deployable tool: scene-aware sampling instead of uniform partitioning, and a reusable artifact an editing agent can query.

FIG 03/20

▸03 /

Architecture

Four layers of increasing cost, each triggered on demand.

L1 alone may settle an unambiguous cut. L2 gives the full visual picture for reordering. L3 fills gaps when the agent must reason about a specific region. L4 grounds an edit in object geometry — "cut right after the Save click." Running alongside all four, the acoustic track (FIG 17) transcribes any audio and pins it to the timeline, enriching every layer.

FIG 04/20

▸04 /

The Analyze Pipeline

One pass turns a file into a queryable map.

Built today: structural pass, scene detection, transcription + alignment, frame extraction, and tiling. With an injected labeler (Studio's claude -p), analyze also writes one audio-visual label per scene; with no labeler it stays inference-free. Proven on our test clip.

Tiered by design: the eager pass writes a one-line label per scene. Full per-scene descriptions are not eager — they're computed lazily and cached on drill-down (L3), so analyze stays fast and you pay for depth only where you ask.

FIG 05/20

▸05 /

Layer 1 · Structural

BUILT

Free, instant, deterministic. The technical profile every other layer references.

Two ffprobe passes, no video decode. The first reads container and stream metadata; the second enumerates keyframe timestamps. The most agent-useful output is edit_constraints.safe_cut_points — the timestamps you can cut at with -c copy and no quality-degrading re-encode.

codec, resolution, fps, duration from -show_format -show_streams
keyframes via -skip_frame nokey -show_entries frame=pts_time
nearest_safe_cut bridges visual understanding to a re-encode-free edit

file structural.pymodel nonecost $0

// vision.json — structural section
{
  "structural": {
    "codec": "h264", "width": 1280, "height": 720,
    "fps": 30.0, "duration_seconds": 12.0,
    "nb_frames": 360
  },
  "edit_constraints": {
    "safe_cut_points": [0.0, 8.333]
  }
}

real output — test_out/vision.json

FIG 06/20

▸06 /

Layer 2 · Visual Map

BUILT

The core insight: a tiled sheet stands in for the whole clip.

real artifact · a random ~2-min clip pulled from YouTube · 38 scenes → 3 sheets · this is sheet 1, scenes 01–16. scene-id top-left, timecode bottom-left, burned per cell.

AdaptiveDetector (threshold 15) finds scene boundaries; the midpoint frame of each scene is extracted, scaled with Lanczos, and tiled into one image. Each cell carries its scene ID so the VLM can reference cells by name, and its timecode so the agent can map a description straight to an edit point.

scene detection + tiling — built in temporal.py
burned overlays — id + timecode on every cell, so the VLM can cite cells by name
eager labels — one audio-visual label per scene at analyze (injected labeler); deep descriptions stay lazy in L3

The grid is descended from IG-VLM, but scene-aware rather than uniform: cells land on real visual changes, not arbitrary time slices.

FIG 07/20

▸07 /

Layer 2 · Fidelity Knob

BUILT

One knob — per-cell resolution — decides how many sheets you get.

Forcing a whole video onto one image shrinks each cell below the resolution a VLM can read (IG-VLM's "more frames hurts"). So sheet count is derived, not fixed: higher fidelity → fewer cells per sheet → more sheets.

preset	cell	cells / sheet	use
low	180	196	bulk scanning
medium	320	64	standard (default)
high	480	25	UI text matters
ultra	720	9	fine detail

Sheets are sequence-named with their timecode range, so lexical sort matches temporal order and an agent can pick a sheet without opening a file.

live in Studio — the resolution slider + a min-scene-length control regenerate sheets on the fly

real run · the test clip (2:00) → 38 scenes → 3 sheets at medium/320px, min-scene 1.0s

FIG 08/20

▸08 /

Layer 3 · Semantic

BUILT

When one frame isn't enough, zoom into the scene's neighborhood.

Understanding is tiered: every scene already carries a cheap one-line label from analyze (L2). Layer 3 is the lazy deep tier — and describe always runs on a zoom, not one frame, because a single midpoint frame breaks down two ways: cells get downscaled until fine detail is illegible, and one frame can't show what changes across a multi-second scene. It also gets the scene's transcript (§Acoustic), and the result is cached.

The temporal zoom is Layer 2's tiling applied recursively, on a known window, so it needs no scene detection. Take the scene, pad it by a neighbor each side, uniformly sample N frames across that window, tile them at higher resolution. Fewer cells → more pixels each; several frames → motion recovered; neighbor padding catches mislabeled boundaries.

# agent-callable — prints a path to read
lookingglass zoom --input vision.json \
  --scene scene-12 --context 1 --frames 9

autonomous — the agent runs it itself when a scene is too ambiguous, reads the sharper sub-sheet, continues
or manual — a human triggers the same drill-down
cached by scene + window + frames + resolution — repeat zooms are free

real zoom · the test clip, scene-23 ±1 neighbor · 9 frames, 00:01:07→00:01:14 · uniform sampling, no detection. f5 is a hard cut to a road shot — exactly the within-scene change the single macro frame misses. Built end to end: vision.zoom + CLI + Studio.

FIG 09/20

▸09 /

Layer 4 · Spatial

v0.3+ DESIGNED

The deepest layer tells the agent where things are.

Grounding DINO does open-vocabulary detection from a natural-language query; SAM 2 returns tracked masks. Together they turn "cut just after the user clicks Save" from a guess into a grounded edit.

The cardinal rule: annotations are metadata, not duplicated images. Detections are cached as a .objects.json sidecar beside the clean sheet; the overlay is rasterized on demand only when an annotated view is requested. Storage stays compact and the clean sheet stays canonical.

partial-cache hits: a query that's a subset of prior detections reuses them
masks memoized per canonical visual state — computed once, reused everywhere

sidecar holds geometry · overlay drawn on the clean sheet only when asked

FIG 10/20

▸10 /

The Data Problem

If the artifact grows unbounded, it re-creates the cost it was built to remove.

the_risk_being_managed!

Inline a full description and object masks for every scene into vision.json, and loading that file becomes the very token and compute burden LookingGlass exists to remove. You'd have built an expensive way to make video understanding expensive again.

The discipline, then: small by default, deep on demand, never compute the same thing twice.

LookingGlass behaves like a finder over a video — build a map once, then every later request is a search or a drill-down. The data layer is foundational: it is stood up before any layer feature code, because every layer writes through it.

FIG 11/20

▸11 /

The Map · Dedup

Product UI repeats. So N scenes collapse to M canonical states, M ≪ N.

A perceptual hash of each representative frame, clustered by Hamming distance, folds near-duplicate frames into one canonical visual state. A login screen revisited four times is one state, embedded once.

SigLIP 2 (so400m / base), OpenCLIP fallback — Apache/MIT, commercial-safe
not JinaCLIP v2 — CC-BY-NC forbids in-house production use
the map is the bounded index; depth records live in the cache

FIG 12/20

▸12 /

The Finder & The Cache

Search without a vision call. Compute each depth at most once.

Finder

A "what's here / find X" query embeds the text into the same SigLIP 2 space and brute-force cosine-ranks the state vectors in NumPy. Below ~10k vectors a vector database is unnecessary. The only model call is the text embed, and that is cached by normalized query string.

It returns matching scenes, timecodes, the nearest safe cut, a coarse label, and a confidence score.

Cache key

key = xxh3(source_video)
    : canonical_state_dhash
    : op          # describe | segment
    : model_version
    [ : params ]

Because dedup maps every near-identical frame to one state, a description or mask is computed once per state and reused by every scene that shares it. The source content-hash namespaces every key, so editing the source invalidates transparently — content-addressed, not mtime.

store diskcache (SQLite, blob spill, LRU + size cap) hash xxhash / xxh3 (BSD-2) dedup ImageHash dHash

FIG 13/20

▸13 /

The Compute Problem

The heavy layers want a GPU. The core install shouldn't.

L1 and L2 are cheap and local — FFmpeg, PySceneDetect, Pillow, NumPy. That is the entire current requirements.txt, and it runs anywhere FFmpeg does.

The map's SigLIP 2 embeddings and especially L4 Grounded SAM 2 are different: PyTorch, optional CUDA, ~600 MB of weights, and 30–60s per clip on CPU. Requiring all of that just to tile a sheet would defeat the lightweight core.

So the heavy layers run through a swappable compute backend. The boundary is the detect / segment / embed contract, not the CLI. An agent that never asks for L4 never installs PyTorch; a host with no GPU still reaches L4 remotely.

compute.backend in config · infra detail in PRD §8.5

FIG 14/20

▸14 /

The Artifact

vision.json — the context document an agent loads once.

Everything an agent needs to understand a clip before issuing a command, in one file: the structural profile, safe_cut_points, the sheet manifest with time ranges, and the scene list with labels, timecodes, and the nearest safe cut for each.

query reads it for agents — --scene-at, --safe-cuts, --sheet-at
context --agent strips binary fields for clean VLM injection
summary records calls made and tokens used — the cost is legible

Heading toward a tiered shape: a compact index plus addressable detail records, so loading the map never re-pays for the depth.

{
  "edit_constraints": { "safe_cut_points": […] },
  "context_sheets": [
    { "id":"sheet-001", "time_range":{…},
      "scenes_included":["scene-01",…] }
  ],
  "scenes": [
    { "id":"scene-04", "label":"New Test Form",
      "start_timecode":"00:00:14.000",
      "nearest_safe_cut":12.012 }
  ],
  "summary": { "analysis_cost":{ "estimated_tokens_used":14800 } }
}

FIG 15/20

▸15 /

Agent Integration

The map injects into context; the agent reasons, then cuts.

# the agent reasons over vision.json
"Remove the intro (scene-01), cut straight to
 the walkthrough (scene-03). Nearest safe cut
 to scene-03 start is 12.012s — stream-copy,
 no re-encode."

# then emits the command
ffmpeg -ss 12.012 -i demo.mp4 -c copy out.mp4

scene IDs and safe cuts turn understanding into a re-encode-free edit

Every operation is a structured command an agent can emit:

analyze — L1 + L2, write the map built
zoom --scene scene-12 — temporal drill-down built
query --scene-at 00:01:04 — which scene? partial
annotate --queries "save button" — L4 designed

Built today: analyze, zoom, and query with --safe-cuts, --scenes, --scene-at. Studio drives all of it through claude -p.

FIG 16/20

▸16 /

The Economics

Roughly 10× cheaper at the prompt boundary — before any output savings.

operation	calls	est. cost
1 sheet · medium	1	~$0.015
3 sheets · high	3	~$0.045
full-frame · 1 scene	1	~$0.015
full-frame · all 18	18	~$0.27
L4 spatial pass	0 API	compute only

For the common case — 1–3 sheets plus a few targeted descriptions — total inference is single-digit pennies per video. The budget is sized to the strictest image limit (Opus); the default model stays Sonnet for cost.

FIG 17/20

▸17 /

Acoustic Track · Experiment

BUILT

A parallel modality: what's said, aligned to where it's shown.

Originally visual-only — but on real footage the dialogue carries half the meaning. When a clip has an audio stream, LookingGlass transcribes it once with faster-whisper (an opt-in extra, lazy-loaded), then aligns each transcript segment to a scene by timecode overlap. Cheap, because the temporal map already exists.

It's not a deeper tier — it's a second input track that enriches the map. Each scene gains a transcript, which feeds the eager labels and the deep describe, so the VLM knows what a scene is about, not only what it looks like. Spoken content becomes searchable too.

only when present — ffprobe gates it; silent clips skip it cleanly
core stays light — Whisper is an opt-in [audio] extra, never imported otherwise
alignment is free — a timecode overlap, no extra model

experiment · the test clipreal run

We grabbed a random ~2-minute YouTube clip (Hitched) and ran the whole pipeline on it. The acoustic track transcribed the audio and pinned every line of dialogue to the scene timeline.

scenes 38 · 3 sheets dialogue most scenes audio en · whisper base

The clearest proof: a phone conversation correctly spans the cut between two consecutive shots — the alignment the map makes free. Visual-only labels landed around 70%; adding "what's said" is what closes the gap toward knowing what the clip is actually about.

FIG 18/20

▸18 /

Where We Are → Next

The MVP loop runs. The data layer is the next foundation.

Parked open questions: audio-aware scene boundaries · waveform sidecars · K-frames "best frame" selection · frames_per_scene > 1 · cross-clip project index.

▸19 /

The Wedge, Stated Plainly

The deployable, agent-callable, scene-aware version of the image-grid technique.

Cost scales with the unique visual states in a clip and the depth an agent actually requests — not with how long the clip is, and not with how many times the agent asks. That is what lets an agent issue effectively unlimited finder calls against a video without the system getting heavier.

L1 · L2 · L3 + ACOUSTIC RUNNING TIERED LABELS + ZOOM-DESCRIBE, CACHED MAP / FINDER / DEDUP FOUNDATIONAL L4 SPATIAL VIA SWAPPABLE BACKEND