Skip to content

© 2025-2026 Dariusz Korzun Licensed under CC BY-NC 4.0
Last updated February 8, 2026


Intelligence and Consciousness

The Operational Reality

Practitioners are relentlessly bombarded with the most distracting question in the history of computer science: "Is the machine conscious?" Philosophers call this the "Hard Problem" of consciousness—the question of subjective experience, or qualia. Is there a "someone" home? Is the light on inside?

For a philosopher, this question is a career. For someone trying to understand and work with AI? It's a trap. A seductive, intellectually fascinating trap—but a trap nonetheless.

There's no scientific test for consciousness. The current consensus across neuroscience and philosophy is stark—science simply doesn't know how to detect subjective experience, in machines or otherwise. A landmark Nature adversarial collaboration testing Integrated Information Theory (IIT) against Global Neuronal Workspace Theory (GNWT)—which began in 2018 and published initial results in 2023, with follow-up analyses continuing through 2025—found that neither leading theory of consciousness was decisively supported. The basic mechanisms of consciousness in biological brains remain unresolved, let alone in artificial systems.

The practical implication: you must ruthlessly distinguish ability from awareness. Whether the system "feels" pain matters less than whether it can diagnose the cause of pain in a patient with 99.9% accuracy.

This isn't a new form of life. It's a powerful, fragile tool.

Turing's Move: From Thinking to Imitation

Alan Turing confronted this same tension decades before computing power existed to test it. In 1950, the question "Can machines think?" was alluring—irresistible, really—but Turing, by then a thirty-seven-year-old mathematician at the University of Manchester, recognized something important: it wasn't answerable in any rigorous sense. The question was too vague, too entangled with definitions that shifted depending on who was asking.

So he replaced the question with an experiment. He called it the imitation game.

The setup is beautifully straightforward. A human judge communicates through text with two unseen participants—one human, one machine. If, over sustained interaction, the judge can't reliably distinguish which is which, the machine passes. With that single reformulation, Turing shifted the field from metaphysical debate to operational testing. Intelligence, in this framing, became defined by success in imitation. The test is behavioral to its core.

Modern large language models now demonstrably pass this test under certain conditions. In 2024, researchers from UC San Diego (Cameron, Jones and Bergen) provided the first rigorous empirical evidence that AI systems can pass the Turing Test—GPT-4 was judged to be human 54% of the time in controlled experiments, while actual human participants were identified as human only 67% of the time. GPT-4 outperformed the classic ELIZA program (22%) by a wide margin.

Interpretation remains contested: while some researchers conclude this constitutes passing the Turing Test, others note that the 54% vs 67% gap suggests judges could still distinguish AI from humans at rates above chance. It's a genuine scientific debate.

Under light conversational probing, these models appear fluent, witty, even self-reflective. They are, in effect, stochastic parrots with a PhD in everything—astonishingly capable mimics that have absorbed the patterns of human discourse without grasping its substance. They're very good at sounding like they understand. That's precisely what makes them dangerous to over-trust.

But this success reveals the test's fundamental flaw rather than the machine's fundamental intelligence. Passing a Turing Test is a statement about surface behavior in a narrow channel, not about the inner life of a system. It proves the machine can simulate humanity, not that it possesses it.

Imitation is not identity.

The Three Limits of Behavioral Testing

The Turing Test has three structural weaknesses that shape how you should evaluate any AI system.

  • Anthropocentrism. The test treats human conversational behavior as the gold standard for intelligence, as if all useful cognition were shaped like a chat between two people. This biases both expectations and designs toward human-like traits. But many of the most valuable AI systems will never look or feel human—and that's a strength, not a defect.

  • The test is purely behavioral. It cares solely about what text emerges from the channel, never about how that output was produced. A system can pass every conversational probe while possessing no understanding whatsoever. Performance becomes a mask that hides the absence of meaning.

  • Narrowness. A text-only dialogue ignores embodiment, perception, and action in the physical world. The test can't determine whether a system can see, move, or intervene safely in a complex environment. For AI deployed in logistics, healthcare, or industrial settings, that blind spot is operationally unacceptable. Language is one modality among many.

Beyond the Turing Test: Modern Evaluation Frameworks

The limitations of Turing's original test have driven researchers to develop more sophisticated evaluation approaches. Even if your primary concern is practical application, awareness of these frameworks matters.

Turing Test 2.0 Frameworks (2024-2026): A family of next-generation evaluation methods has emerged. They assess AI through creativity, resource efficiency, and multi-modal performance rather than mere conversational imitation.

The Lovelace Test 2.0 (Riedl, 2014): This test focuses on creative capability. The question it asks: can the AI produce artifacts that its creators can't fully explain? This targets genuine novelty rather than pattern-matching sophistication. It reaches at something deeper than the Turing Test ever could.

Integrative Turing Tests: These are systematic probes across object detection, captioning, attention prediction, word associations, and free-form conversation. They use both human and AI judges to evaluate indistinguishability. It's a much more comprehensive approach than anything Turing could have imagined.

Capability-Based Benchmarks: This is where modern AI evaluation has largely moved. The field has shifted from "can it fool a human?" to "can it reliably accomplish specific tasks?" Benchmarks like MMLU, HumanEval, SWE-bench, and domain-specific evaluations provide measurable capability metrics. This is far more useful for practical purposes. You can actually base decisions on these numbers.

The Turing Test was a conceptual breakthrough—genuinely brilliant for its time. But it shouldn't serve as an evaluation framework. Use task-specific benchmarks, reliability under distribution shift, and failure mode analysis to determine whether a system performs reliably at specific operations within defined constraints.

The Chinese Room: Syntax Without Semantics

The most incisive critique of behavioral tests comes from philosopher John Searle and his 1980 thought experiment: the Chinese Room.

Picture a person locked in a sealed room. They don't speak any Chinese—not a word. Through a slot in the door, Chinese symbols are passed to them. They possess an enormous rulebook that instructs: "When you receive this symbol, respond with that symbol." That's all they have—rules for matching patterns.

The person receives symbols, processes them via the rules, and pushes symbols back out through the slot. To an outside observer, the room is perfectly fluent in Chinese conversation. Completely indistinguishable from a native speaker. But to the person inside? Meaningless shapes shuffled according to instructions. No understanding whatsoever. Just pattern-matching.

Searle's argument is precise: symbol manipulation isn't understanding. The person in the room has syntax—the rules for arranging symbols—but zero semantics. There's no meaning attached to any symbol. The appearance of fluency is manufactured entirely from pattern-following. This is exactly what large language models do, just at a much larger scale and much faster speed.

There's a standard counterargument worth acknowledging. Perhaps the system as a whole—the person, the rulebook, and the room together—possesses understanding, even if the person alone doesn't. On this view, meaning emerges from the composite behavior rather than residing in any single component. It's called the "systems reply," and philosophers have been debating it for decades.

But the metaphysical dispute is less important than the operational lesson. Whatever philosophers conclude, systems that impress with fluent outputs may still lack grounded understanding. Fluency may feel like comprehension. It sounds like comprehension. But it might not be.

The Symbol Grounding Problem

The Chinese Room points to a deeper issue that has haunted artificial intelligence since its inception: the symbol grounding problem, articulated formally by Stevan Harnad in 1990.

How does the word "apple" inside a computer connect to the physical reality of a crisp, red fruit you can hold in your hand? How do symbols acquire meaning? This sounds like philosophy, but it has profound practical implications.

In biological intelligence, symbols are grounded through sensory experience and interaction with the physical world. The word "apple" connects to accumulated encounters with taste, texture, weight, and visual appearance across a lifetime. Thousands of apples. The map becomes territory through embodied experience. Holding apples. Eating them. Knowing what that crunch sounds like, what that juice tastes like. That's grounding.

Current AI systems lack this grounding. They manipulate symbols with extraordinary sophistication, but they have no pathway to connect those symbols to physical reality. When a model produces the token "apple," nothing ties that string to the fruit you can hold, smell, and bite. The system's "world" is a cloud of statistical correlations—patterns of co-occurrence across vast corpora of text, images, and audio. The model doesn't know what an "apple" tastes like; it only knows that the word "apple" often appears near "red" and "fruit" in its training data. That's a very different thing.

The Multimodal Development (2024-2026): Recent research has complicated the picture in ways nobody fully anticipated. Vision-language models (VLMs) trained on paired image-text data show preliminary evidence that grounding may emerge even without explicit grounding objectives. Nobody programmed this in. It just appeared.

Studies in 2024-2025 (Cao et al., Bousselham et al., Schnaus et al.) demonstrate that attention heads in these models retrieve environmental context when predicting linguistic tokens—a behavior consistent with implicit grounding. Some researchers argue that by learning rich statistical patterns over human language, LLMs may acquire genuine semantic content not through direct world interaction, but by capturing the relational structure humans have encoded in text.

However, this remains deeply contested. A 2025 Frontiers in Systems Neuroscience review concluded that LLMs remain subject to the symbol grounding problem because "the meanings of the words they generate are not grounded in the world."

The practical implication: multimodal models may exhibit functional grounding—behavior that resembles understanding in specific contexts—while still lacking the robust, generalizable world connection that would prevent systematic errors.

Don't confuse emergent correlations with genuine comprehension. Symbol grounding remains an open problem, and that open status has direct consequences for everything you deploy. It explains why models can be both astonishingly fluent and glaringly nonsensical about basic physical situations. They're optimizing consistency within their symbol space, not fidelity to an independent world.

Moravec's Paradox: The Inversion of Difficulty

Another major limitation manifests in counterintuitive ways. It's called Moravec's Paradox, named after roboticist Hans Moravec, and it inverts intuitions about what should be computationally hard.

High-level reasoning—chess, theorem proving, medical diagnosis, poetry—was assumed to be the hardest thing for computers to do. That assumption was wrong. Spectacularly wrong. Those tasks are computationally tractable. They follow rules, logic, and patterns that can be encoded and optimized.

What's actually hard? The things a two-year-old does effortlessly. Perceiving a cluttered room. Walking on uneven ground. Folding laundry. Knowing that if you drop a glass, it'll shatter—and that an elephant won't fit in a backpack. This gap between abstract processing and physical common sense is enormous.

The paradox emerges from evolutionary history. Human brains have had millions of years to refine perception and motor control, baking those skills into deep, efficient neural circuits. Abstract reasoning, by contrast, is a recent evolutionary add-on; humans struggle with algebra precisely because their hardware wasn't primarily optimized for it. Digital computers invert this profile entirely. They excel at structured symbol manipulation and struggle with the messy, high-dimensional signals of the physical world.

Intelligence isn't a single dial you can turn up—it's a vector of highly uneven capabilities. Intuitions about difficulty were built for brains, not for silicon. What's hard for humans is easy for machines; what's easy for humans is hard for machines. It's almost perfectly inverted.

The Modern Inversion (2025): A new version of Moravec's Paradox is unfolding in real-time. AI systems now solve International Mathematical Olympiad problems, write production-quality code, and pass professional exams—the kind of stuff that would've seemed like science fiction five years ago. Yet they still struggle to reliably use a computer mouse, navigate a cluttered desktop, or perform basic GUI interactions. The abstract has become easy; the embodied remains hard.

Humanoid robotics is making rapid progress against this constraint. The leading players are racing to prove both functional dexterity and scalable production. Foundation models for robotics are maturing faster than most expected. Yet a significant gap remains between ambitious demonstrations and reliable real-world deployment.

Actual 2025 deployment data reveals the reality. Global humanoid robot shipments reached 13,317 units in 2025. That's a 480% increase year-over-year—sounds impressive. Chinese manufacturers dominated volume—AgiBot alone shipped 5,168 units, while U.S. firms (Tesla, Figure AI, and Agility Robotics) each shipped approximately 150 units. Production scaling and commercial deployment have diverged from capability demonstrations. Chinese vendors have achieved thousand-unit manufacturing scale while Western firms focus on advancing technical capabilities.

Moravec's insight functions as a warning. Don't assume that because a system can draft a contract, summarize a legal brief, or debug a program, it'll also handle vision, robotics, or common-sense interaction with the environment. These aren't neighboring capabilities. They're not even in the same building.

They occupy entirely different technical regimes. A brilliant chess player doesn't automatically know how to ride a bicycle. The same principle applies to AI. Though the gap is narrowing faster than many predicted.

The Strategic Reality

This isn't an emergent consciousness. It's a powerful and complex tool.

A tool requires maintenance, boundary conditions, and realistic expectations. No one gets angry at a hammer when it can't solve a calculus problem. But an autonomous agent? When you treat a tool as a person, you inherit failure modes that were never designed for.

Clear thinking about intelligence is the first and most important safety system.

Success depends on understanding the difference between a simulation of thought and the act of thinking. That distinction guides every decision.