Text Is a Lossy Interface

Vibe: A distinctive feeling or quality that can be sensed — but not easily defined.

I was playing with same.energy, a visual search engine powered by CLIP embeddings, when something clicked. Compared to Pinterest, its results felt different — more semantically fixated. Less stylistic drift, more concept lock-in. Where Pinterest returns fish, boats, and ballerinas for the same query, same.energy fixates on a single type of composition. Always a solitary figure. Take a look:

That difference nagged at me. Not because of what it said about search algorithms, but because of what it revealed about the gap between seeing and saying.

The gap between seeing and saying

I'm a visual thinker. I've always been. But I spent my teenage years learning to operate in a language that wasn't mine. Second language gave me words, but it took away precision. I could describe what I saw, but I couldn't convey what I felt about what I saw.

That experience shaped how I think about creative tools. Language is powerful. It's also lossy.

In visual art — painting, photography, film — artists use color theory, rhythm, composition, balance, form, and texture to create emotional effects. The techniques are concrete. But the experience is always subjective. There's a gap between the artist's intention and the viewer's perception. That ambiguity is kind of the point.

But when you ask someone to control a generative model through text, that ambiguity becomes a problem. You're asking people to verbalize the thing that resists articulation.

The mismatch

Modern text-to-image models have a cold-start problem baked into the process: you imagine an image in your head, translate it into words, and the moment you do, you've already distorted it. You've lost something.

This isn't a prompt engineering problem. It's a fundamental interface problem. The medium of control — text — doesn't match the medium of output — images. You're steering a visual system through a verbal channel.

Even with reference images and tools like ControlNet, the gap between what you intend and what you get remains wide. Not because the models are bad. Because the interface is wrong.

Text is a lossy interface for visual intent.

We built the answer

I wrote the same article addressing the issue in April 2025. At the time, it was an open question — can we create interfaces that let people steer image generation visually, without forcing everything through language first?

Then we built aether studio.

The core insight: if text is a lossy interface for steering images, then the interface needs to be visual. Not text-to-image. Image-to-image, guided by a flow.

You compose reference images on an infinite canvas — a model's pose, an outfit, a mood reference, a color palette — and direct the AI through conversation: "make it moodier," "zoom in," "try a different angle." The AI sees what you see. The spatial arrangement carries intent that words can't.

Here's what that looks like in practice. A Spotify playlist cover, composed from references and refined through conversation — from a first attempt to the final version:

Spotify playlist cover progression from first composition to final result — Final

Spotify playlist cover progression from first composition to final result — First attempt

First attemptFinal

Six steps from references to final. Each built on the last. The creator wasn't generating random outputs — they were directing.

And every decision — every branch explored, every direction rejected, every variation refined — is captured as a node in a visual graph. Your creative process has memory. Nothing is lost.

You can see this idea pushed to its limit in a 21-image fashion lookbook — one continuous session spanning New York to Seoul, where every prompt built on the last because the tool preserved the context.

Text is lossy. The canvas isn't.