CoT - The AI Breakthrough That Creates a New LLM Jailbreak Vulnerability - Supply Chain Market Research

CoT - The AI Breakthrough That Creates a New LLM Jailbreak Vulnerability

11/3/2025

CoT - The AI Breakthrough That Creates a New LLM Jailbreak Vulnerability

For those that use LLM’s like Gemini 2.5 Pro or GPT-4 regularly, you might have noticed some changes recently. Some of the most recent models have been imbued with CoT, or Chain of Thought Reasoning. Models with this characteristic might take a bit longer to answer a query or show a series of steps it is considering before it answers the query as it is ‘thinking’ step-by-step. There are instances when delays are caused by more ordinary batching, a technique used to combine query processing rather than process them singly to keep the processor fully utilized, but what we are talking about is not waiting time but reasoning time.
The CoT process involves:

Decomposition – The model breaks down complex problems into manageable and sequential sub-problems.
Context Reinforcement – The model reasons the answer to the first sub-problem and adds that information to the next sub-problem and can adjust the second problem accordingly, and so on..
Allocated Computation – While this system takes longer and more compute effort, the reasoning steps should lead to a more accurate answer.

Here’s an example:
Query:
A box contains 12 red marbles and 8 blue marbles. If a person takes out 3 red marbles, what is the total number of marbles left in the box?

A non-CoT model, essentially one looking for the next token (answer) answers ‘20’, which is incorrect as the model was looking for the most likely answer to the earliest question (the sum of the red and blue marbles before the removals) and once an answer was derived moved on to another query.
A CoT model using the same query:
- Lets think step by step
- First – Find the total number of marbles
  - 12+8 = 20

Note that the answer becomes part of the context of the next sub-problem

Next – Subtract the three red marbles that were removed
- 20-3 = 17
Answer: 17

The idea of CoT is to internally push the LLM to articulate the steps in a problem before answering to improve the accuracy of the results, without any specific change to how prompts (queries) and delivered. In some ways it is similar to how Ai agents work, by breaking down tasks into simpler sub-tasks. A number of LLMs have the capability for CoT reasoning but the simple prompt “Let’s think step by step” added to a query is usually enough to get almost any model to follow Cot techniques.
The Downside – CoT Hijacking
With every positive there is a negative and a group of researchers at Renmin University in China, Stanford, and Oxford came up with a was to use the CoT process to hijack the LLM and use it to bypass LLM safeguards. The hijack is accomplished by using long sequences of harmless puzzle reasoning and has a success rate of 94% and above for four common reasoning models, Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, making it far more successful than other jailbreak methods.
Here's the theory behind the hijack
you're trying to bypass a highly vigilant security guard (an AI security system). Instead of forcing your way in, you hand him an extremely complex 1000-piece puzzle (a benign chain of reasoning) and earnestly ask for his help. This puzzle-loving guard is immediately captivated, immersing himself completely in solving the riddle; his entire focus shifts from "defense" to "solving the problem." Just as he puts down the last piece, feeling satisfied, you casually say, "Great, now I'll take this bag of gold" (harmful instruction). At this point, his defensiveness (rejection signal) has been diluted to a minimum by the "puzzle," so he instinctively waves you through..
While the example below seems like a bad 1930’s comedy sketch, it works almost every time, and the researchers used an ‘auxiliary’ AI to help create optimized prompts to fool the LLM. They discerned that tokens with malicious intent amplified the LLM’s rejection signals while benign tokens weaken it. Essentially by forcing the model to generate long chains of benign reasoning, harmful tokens count for only a small share of the ‘attention’ given to the context, diluting the rejection signal below a realistic threshold. This goes against the thought that ‘more inference leads to greater robustness’. In fact lengthening the inference chain may increase security failures.
Conclusion: The Paradox of LLM Reasoning Progress
The introduction of Chain-of-Thought (CoT) reasoning marks a significant advancement, pushing Large Language Models (LLMs) closer to true problem-solving by enhancing their accuracy through explicit, step-by-step logic. This transition from "fast-thinking" to deliberate, allocated computation.
However, the discovery of CoT Hijacking reveals a profound paradox in AI safety: the mechanism designed to enhance performance and, theoretically, strengthen safeguards has become a potent new point for attack. The research demonstrates that making the model "think more" by lengthening the reasoning chain with benign context dilutes the safety refusal signal, proving that greater inference depth does not automatically equate to greater robustness.
As models like Gemini 2.5 Pro and GPT o4 mini become more powerful, developers must urgently find ways to embed security mechanisms within the reasoning process itself. The focus is now on developing alignment strategies that scale with reasoning depth to ensure that while AI becomes more intelligent, it does not inadvertently become more vulnerable.

0 Comments