We treated language models like brains in vats, fed only by slips of paper. Text in, text out. They knew about the world only through linguistic descriptions—they had read about the color red but never seen it, read about music but never heard it, read about motion but never watched it. Their knowledge was purely symbolic, disconnected from the sensory experience that grounds human understanding.
But intelligence requires perception. With Gemini and similar multimodal models, we're finally giving the brain eyes and ears. The awakening has begun.
The Grounding Problem
Philosophers have long worried about the "symbol grounding problem"—how can symbols acquire meaning if they're never connected to the real world? A text-only model knows that "dog" is associated with "bark," "tail," "pet," and thousands of other word patterns. But does it know what a dog is? Does it understand "dog" the way someone who has seen, touched, and heard dogs understands?
The concern was that language models were building elaborate symbolic structures with no foundation—castles of words floating in abstraction. They could manipulate language brilliantly while fundamentally lacking understanding.
Multimodal models change this equation. When a model is trained on images, videos, and audio alongside text, the symbols become connected to sensory patterns. "Dog" is no longer just a word-node in a semantic network; it's linked to visual patterns of furry bodies, to audio patterns of barks and pants, to motion patterns of running and jumping. The symbol is grounded.
The Perceptual Capabilities
The multimodal capability emerging in models like Gemini goes beyond just recognizing what's in an image. These models can understand video—tracking objects, interpreting actions, following narratives across time. They can process audio—understanding speech, music, and environmental sounds. They can examine code—not as text but as functional structures.
This combination—understanding video, audio, code, and text simultaneously—is a step toward grounded intelligence. A machine that can watch a YouTube video and fix a car engine is fundamentally different from one that just reads the manual. The video shows what the parts actually look like, how the mechanic's hands move, what sounds indicate problems. This is how humans learn complex physical tasks.
The capability extends to reasoning across modalities. The model can be shown a diagram and asked to explain it. It can hear a conversation and summarize the key points. It can watch a procedure and write instructions. The modalities become interchangeable views of the same underlying information.
From Symbols to World
We are moving from "symbol manipulation" to "world understanding." The map is becoming the territory. Where text-only models knew about the world through descriptions, multimodal models know the world through perception. The knowledge is more direct, more grounded, more robust.
This has practical implications. Image understanding enables applications that pure language models couldn't address—analyzing photos, interpreting charts, processing documents with figures. Video understanding enables analysis of recordings, surveillance footage, and instructional content. Audio understanding enables transcription, sound recognition, and voice analysis.
But the deeper implication is for the nature of AI understanding itself. If meaning comes from grounding in sensory experience, then multimodal models might genuinely understand in ways that text-only models cannot. The philosophical debate continues, but the empirical capabilities are advancing.
Sensors as Senses
Sensors are the eyes and ears of the machine. A text-only model is blind and deaf; its world is made entirely of words. A multimodal model can see, hear, and—through integration with robotics—eventually touch. Each modality adds a dimension of understanding, a way of knowing that text alone cannot provide.
The awakening is the realization that intelligence isn't purely abstract. It's grounded in the world, shaped by perception, connected to physical reality. The models are opening their eyes. What they'll see, and what they'll understand because they can see—this is the frontier we're exploring.