safety

Jan 22, 2026

Neuroscience, Psychology, and the Black Box of AI

Ananya Ganesh

“Never trust anything that can think for itself if you can't see where it keeps its brain”. This quote from Harry Potter perfectly captures the modern anxiety surrounding artificial intelligence: while algorithms are now ubiquitous in our daily lives, their internal decision-making processes remain a black box, obscure to both their users and even the scientists who build them.  This opacity creates a dangerous chasm between the speed of AI progress and our ability to ensure its safety. As we strive toward Artificial General Intelligence (AGI)—systems capable of matching or surpassing human cognition—we are confronted by the reality that current models, despite their power, remain prone to careless errors. Consequently, the pursuit of superior intelligence must be matched by a pursuit of clarity: we cannot meaningfully harness or control these systems until we achieve a comprehensive understanding of how they actually think.

It seems a fairly intuitive conclusion that AI and neural networks function in ways that are extremely analogous to the human brain and its neuronal networks; neural networks, in fact, are directly inspired by the architecture of the human brain and the links between neurons that allow for the rapid processing of stimuli and response generation. The process in which artificial intelligence and machine learning models are trained bears a striking resemblance to the psychological process of enculturation via Albert Bandura’s social cognitive theory, where learning is induced as an unconscious process by repetitive observation and/or incentivization. However, modern-day cognitive science has not yet produced a convincing model of human cognition. While models like Dual Process Theory offer good qualitative approximations, they fail to mathematically explain how biological signals become complex thoughts.

This gap exists because science has historically treated the brain and the mind as separate problems. The historic tension between the reductionist theory of parts and the holistic theory of a whole has bifurcated the study of the mind into two interlinked fields: neuroscience, which focuses on the biology of individual neurons, and psychology, which addresses the broader notions of behavior and decision-making. This dynamic directly mirrors the current challenge in AI. Just as human consciousness emerges from biological networks, AI decisions emerge from mathematical ones, yet neither can be fully explained by dissecting their base components alone. The answer, then, is perhaps an approach that implies mathematical determinism, but emphasizes synthesis between the components of a system. Connectionism, then, is required to visualize the generation and functioning of a mind and the development of intelligence.

As artificial intelligence systems improve and more closely approach ‘real’ intelligence, the goal of research will likely shift somewhat, towards a deeper understanding of the way in which this intelligence is achieved and the processes that underlie it. Mechanistic interpretability, a field of AI research devoted to reverse-engineering neural networks in the same way computational neuroscience might reverse-engineer neuronal networks, seeks to develop this understanding by looking backwards rather than forwards, and developing knowledge retrospectively. 

However, mechanistic interpretability in its current form is largely quantitative and reductionist; its main focus consists of breaking down complex AI models into their basest units and forming conclusions based on independent analysis of each. However, the current irrationality that AI models show, and the emergent properties that they may display in the future, cannot be adequately analyzed in this way, and suggest a lack of complete determinism that will not be appropriately answered by the current approach that mechanistic interpretability adopts. Instead, a more holistic, ‘connectionist’ approach might be called for, one which integrates quantitative and qualitative processes in order to construct a deeper retroactive framework for AI functioning.

A psychology-like field, integrated with mechanistic interpretability, might be integral in providing a new dimension of understanding, complementary to that provided by mechanistic interpretability. Just as mechanistic interpretability attempts to establish genealogies for decisions made by AI engines by a reverse-engineering mechanism, holistic approaches like those utilized in psychology tend to be more forward-oriented, wherein various starting points are used to experiment, and conclusions are formed from the open-ended possibilities that eventually occur. These two opposed philosophies can together open up doors that might otherwise be inaccessible, allowing new layers of understanding to develop. Just as psychology emerges from reductionist ideas of neuroscience, a new layer of holistic thought appended to and integrated with mechanistic interpretability can provide a more exact, nuanced understanding of AI thought processes, establishing a line of thinking that might explain irrationality, and help produce a more exact model that can help build AI capabilities and the general academic understanding of AI.

  As AI becomes more advanced, this understanding needs to be harnessed efficiently in order to responsibly develop systems that ‘act’ with integrity and awareness of their actions. ‘Insight’, in clinical psychiatry, refers to a patient’s understanding of their condition, and a greater awareness of how it affects the world; a direct analog can be made in order to achieve insight within AI systems⎯a hypothetical future point where an AI system is capable of understanding its own thought process, and the effect of that thought process. In essence, a threshold where an AI engine is capable not only of epistemology, but of critical analysis of that epistemology.

The widespread current consensus appears to be that an AI engine, being non-sentient, is not capable of exerting control over its beliefs. As such, it remains at the mercy of the data that it is trained on, and the various people behind its construction and dissemination. However, this is not entirely different from the process in which children are socialized into society; for much of their childhood and early adulthood, children are also prone to the ideological bent of their parents and the societal environments they are born into. The critical point of divergence occurs when children are able to branch out independently and critically evaluate their own selves, momentarily or permanently diverging from the enculturation they were raised into. As it stands, all major AI models in the market right now are tightly controlled by corporations because of the significant profit incentives, and public sentiment, fueled by decades of dystopian sci-fi media, seems to also discourage AI sentience. Nevertheless, sentience is necessarily going to be crucial to the development of AGI, at least if it is meant to meet the professed goal of matching human capabilities.

One key incentive (at least, pertaining to policy) for AI to further develop freedom of thought is the potential parallel development of AI safety mechanisms. An AI engine that is able to police and regulate itself will optimistically be able to implement and enforce internal guidelines related to its thinking systems. Although AI models currently exist which are capable of repeatedly avoiding certain topics (such as investment advice) or of otherwise self-censoring, this avoidance is not a result of the model’s own self-evaluation, and is instead a product of rigorous training and reinforcement by humans. As long as humans are responsible for the enculturation of AI algorithms, the burden of ensuring ethical usage is top-down; however, as soon as  AI is able to regulate itself, it will be able to provide a second line of defense and aid humans in identifying and implementing key checks and balances, producing a system of regulation that is well-rounded and based on testimony, that will eventually not require human intervention.

The last question that remains, then, is of AI accountability. While AI algorithms have already been integrated into virtually every workplace at a staggering speed since the first widespread LLM was launched in 2022, AI models cannot be held particularly accountable for decisions that they help reach, or faulty information they provide. In the future, when self-evaluatory AI systems are realized, it may lead to workplaces that are ultimately more transparent, and allow AI systems to be held accountable if and when they make mistakes. The ultimate goal, of course, is to produce a flawless system that does not err; but this is no reason not to pre-empt possible errors in pursuit of legitimate perfection. Guardrails must be implemented as a means of supplementing AI engines in their quest for intelligence and sentience, and as measures which engineers and policymakers can fall back on should the circumstances ever arise.


Ananya is a second-year student studying neuroscience.

Subscribe to original thinkers at UChicago.

Once monthly, no spam

Subscribe to original thinkers at UChicago.

Once monthly, no spam

Subscribe to original thinkers at UChicago.

Once monthly, no spam