Originally published in Chinese on HK01 on 2026-02-09 07:00 | By Michael C.S. So | AiX Society

When we talk about the next turning point for artificial intelligence, a growing number of scientists, entrepreneurs, and developers are focusing on a seemingly niche yet critically important field — Spatial 3D AI (three-dimensional spatial artificial intelligence). This is not merely about enabling machines to understand the geometric structure of three-dimensional space. It is about giving AI the ability to truly comprehend the physical world, equipped with human-like spatial awareness, bodily perception, and commonsense reasoning. This capability is the foundational infrastructure on the path toward Artificial General Intelligence (AGI).

What Is Spatial 3D AI?

Spatial 3D AI refers to AI that can “see” and understand the physical world. It goes beyond recognizing objects from 2D images — it must be able to construct three-dimensional scenes, infer the depth and relative positions of objects, anticipate potential physical interactions, and even predict the outcomes of actions. For example, a spatially intelligent AI that sees a glass of water teetering on the edge of a table would predict that if someone bumps the table, the glass will fall and spill.

This is not simply a sensing technology problem. It requires AI to learn how to integrate visual, auditory, tactile, and other sensory data to build an internal “World Model” that simulates possibilities within an environment. This capability is innate in the human mind, but for AI, it represents an entirely new challenge.

Why Is Spatial Intelligence a Prerequisite for AGI?

The reason humans possess “common sense” is that from early childhood, we build our capacity for causal reasoning and spatial memory through interaction with the three-dimensional world. This is precisely the viewpoint that Professor Fei-Fei Li has been championing at Stanford in recent years. In a 2025 essay for TIME, she wrote: “AI, if it lacks an understanding of space, cannot truly possess common sense and reasoning ability. Spatial intelligence is the scaffolding of human cognition.”

She went further, pointing out that while large language models can tell brilliant stories, write code, and answer complex questions, they are like “blind storytellers” with no knowledge of the physical world. She wrote: “LLMs are eloquent but inexperienced, knowledgeable but ungrounded. They talk about the world but don’t truly know the world.”

Therefore, if AI cannot grasp three-dimensional spatial geometry, the causal relationships between objects, and the reasoning between actions and their consequences, it will forever remain at the level of symbols and language — unable to enter the realm of true “intelligence.”

Omniverse and Digital Twins: AI’s Virtual Training Ground

One of the most important spatial training platforms for AI today is NVIDIA’s Omniverse. It is a physically accurate 3D digital twin simulation platform that allows enterprises to design, deploy, and optimize real-world systems and processes within a virtual environment.

NVIDIA CEO Jensen Huang has stated: “Everything that moves will be robotic and embodied by AI. Omniverse will be the operating system of physical AI.” This vision is already being realized at companies like BMW and Amazon. BMW, for instance, recreated its automobile factory within Omniverse to simulate production line modifications, successfully achieving a 30% efficiency improvement. Amazon simulates 500,000 warehouse robots for scenario rehearsals and layout optimization, saving enormous physical testing costs.

These virtual worlds generate vast amounts of synthetic data for AI training, effectively addressing the difficulty and expense of acquiring real-world data.

Embodied AI: Intelligence Is Not Just About Having a Brain

Spatial intelligence cannot rely solely on the brain (the model) — it also needs a body (embodiment). Professor Fei-Fei Li and other neuroscientists emphasize that intelligence is the product of sensory perception, motor action, and environmental interaction. If AI cannot interact with the world through cameras, depth sensors, robotic arms, and other hardware, it will be unable to develop true “common sense.”

She noted: “It is much more likely that AI systems will develop human-like cognition if they are built with architectures that learn and improve in similar ways as the human brain, using connections to the real world.”

Platforms such as Meta’s AI Habitat and AI2’s THOR are built precisely for this purpose — providing simulated spaces where AI agents can undergo “simulated experiences” and acquire real-world operational capabilities through reinforcement learning.

World Models: AI’s Inner Universe

For AI to truly understand physical environments and future scenarios, it needs a built-in “World Model.” Fei-Fei Li defines such a model as having three key characteristics:

  • Generative: Capable of producing 3D worlds that are semantically and physically coherent, and able to simulate the progression of events within them.
  • Multimodal: Able to integrate language, images, sound, depth sensing, and other diverse sensory inputs.
  • Interactive: Able to calculate updates to all environmental variables after a given action is input.

Her startup, World Labs, has released a prototype system called Marble that can generate interactive, navigable virtual 3D worlds from simple text prompts, providing AI with richer “mental simulation sandboxes” for training.

The Next Stop for General Intelligence: From “Words” to “Worlds”

The first decade of AI development focused on language and image recognition. But in the next decade, if AI is to truly enter the human world and assist with work and daily life, it must develop the ability to understand space. Fei-Fei Li wrote: “To move toward AGI, we must move from words to worlds.”

This kind of AI would no longer be just an assistant that responds to questions. It would be capable of entering real spaces — homes, factories, hospitals — understanding human language and translating it into real-world action decisions, equipped with the ability to perceive, remember, predict, and adapt.

Spatial Intelligence: Opening the Gateway to AGI

For artificial intelligence to be truly “general,” it cannot remain at the level of data and language processing. It must be able to perceive the real world, understand environmental structures, predict physical events, and share spaces and tasks with humans. Spatial 3D AI is precisely what opens that essential door on the road to AGI.

As Fei-Fei Li said: “Without spatial understanding, AI is blind to the real world. With it, we begin to see the potential for machines to reason, imagine, and collaborate as we do.”

Spatial intelligence is not an add-on feature for AI — it is the fundamental capability that will carry it into an era of true intelligence and coexistence. In the future, we do not just expect AI to speak eloquently; we expect it to “enter our world” and co-create new value alongside us.

Share this post

Subscribe to our newsletter

Keep up with the latest blog posts by staying updated. No spamming: we promise.
By clicking Sign Up you’re confirming that you agree with our Terms and Conditions.

Related posts