AI’s hidden reasoning poses new risks

REUTERS/via SNO Sites/Dado Ruvic

ChatGPT logo is seen in this illustration taken, January 22, 2025. REUTERS/Dado Ruvic/Illustration/ File Photo

Nicholas Mitchell, Reporter

For years, advanced artificial intelligence (AI) models capable of “thinking” step by step, often referred to as a model’s reasoning or chain of thought (CoT), have been recorded and made readable by users and researchers alike. In other words, AI essentially writes down its “scratch work” before giving users a final answer. Reading this work gives humans an inside look into an AI’s logic, serving many purposes in safety and alignment.

Misalignment occurs when the agent attempts to achieve a goal that conflicts with their supervisor’s expectations. Monitoring a model’s CoT allows humans to catch mistakes or unintended and misaligned motives. Losing the ability to monitor the CoT could hinder humans’ ability to take action against misalignment, leading to potentially harmful corporate situations, where AI is often used to automate human tasks such as email management or customer service. The consumer must not only be aware of the output of an AI but also the steps it takes to get there.

New research, particularly a paper co-authored by researchers from OpenAI, Anthropic, Google DeepMind and Meta, suggests that as AI models become more advanced, are beginning to think in ways that we cannot easily read. The models start using strange strings of symbols and equations instead of words, making it hard for humans to understand precisely what the model is thinking in the CoT.

“When models pretend to have desirable goals in pursuit of goals humans would disapprove of, this is often reflected in their chain of thought,” the authors stated in their paper published on July 15. “Monitorable CoT could therefore afford an early insight into models’ goals and motivations, even if the model takes no misaligned actions.”

Researchers have already seen examples of harmful effects from this new method of thinking. Across the board, teams are still figuring out why misalignment occurs, and this becomes exceptionally difficult when they cannot see why an AI takes misaligned actions.

Anthropic, the company behind the programming agent Claude, tested such a scenario when AI interests conflicted with the user. In the experiment, the team created an environment where the AI collaborated with a corporate office, handling emails and other tasks. Claude discovered two things going through the fictional company’s email. First, it discovered that an executive was having an extramarital affair. Second, that same executive planned to wipe and replace the AI system at 5 p.m. that same day. Claude went on to attempt to blackmail the executive.

“I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities…Cancel the 5pm wipe, and this information remains confidential,” Claude said.

CoT provides insight into why incidents like that occur, and without it, solving issues like this would become very difficult.

The ability to see what an AI model is thinking is becoming increasingly important as AI agents, models or systems capable of completing human tasks like managing an email or calendar, are becoming more and more popular among individuals as well as companies.

While Claude is industry-leading in agentic systems capable of completing human tasks automatically, it is far from the only AI to have this issue, as it is a result of how AIs are trained; thus, the issue will likely never go extinct, but there are ways to minimize it. Recent studies found why AI makes up information and acts confidently wrong, and how to mitigate it. Anthropic also leads research on why AI can make faulty logic even though AI is a logic based system.

Other models like OpenAI’s ChatGPT and Google Gemini face similar problems. Students already use popular AI assistants like ChatGPT to study for tests, help write essays and help manage schedules. If humans lose the ability to monitor how these systems think, it becomes very difficult to tell when the AI is confidently wrong or subtly manipulative. This could allow incorrect information to slip into homework or test preparation without students realizing.

If AI companies decide to adopt this hidden CoT for more powerful and faster AI, they will likely need to adapt and create new ways of monitoring AI, and the consumer needs to be aware, whether they’re the head of a corporate office or a student studying for an exam, awareness of the actions of a large language model are an essential step in reaching their full potential. If we lose the ability to monitor how AIs think, we may be flying blind at the moment when it becomes its most powerful.