CoT - The AI Breakthrough That Creates a New LLM Jailbreak Vulnerability
The CoT process involves:
- Decomposition – The model breaks down complex problems into manageable and sequential sub-problems.
- Context Reinforcement – The model reasons the answer to the first sub-problem and adds that information to the next sub-problem and can adjust the second problem accordingly, and so on..
- Allocated Computation – While this system takes longer and more compute effort, the reasoning steps should lead to a more accurate answer.
Query:
A box contains 12 red marbles and 8 blue marbles. If a person takes out 3 red marbles, what is the total number of marbles left in the box?
- A non-CoT model, essentially one looking for the next token (answer) answers ‘20’, which is incorrect as the model was looking for the most likely answer to the earliest question (the sum of the red and blue marbles before the removals) and once an answer was derived moved on to another query.
- A CoT model using the same query:
- Lets think step by step
- First – Find the total number of marbles
- 12+8 = 20
- Next – Subtract the three red marbles that were removed
- 20-3 = 17
- Answer: 17
The Downside – CoT Hijacking
With every positive there is a negative and a group of researchers at Renmin University in China, Stanford, and Oxford came up with a was to use the CoT process to hijack the LLM and use it to bypass LLM safeguards. The hijack is accomplished by using long sequences of harmless puzzle reasoning and has a success rate of 94% and above for four common reasoning models, Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, making it far more successful than other jailbreak methods.
Here's the theory behind the hijack
you're trying to bypass a highly vigilant security guard (an AI security system). Instead of forcing your way in, you hand him an extremely complex 1000-piece puzzle (a benign chain of reasoning) and earnestly ask for his help. This puzzle-loving guard is immediately captivated, immersing himself completely in solving the riddle; his entire focus shifts from "defense" to "solving the problem." Just as he puts down the last piece, feeling satisfied, you casually say, "Great, now I'll take this bag of gold" (harmful instruction). At this point, his defensiveness (rejection signal) has been diluted to a minimum by the "puzzle," so he instinctively waves you through..
While the example below seems like a bad 1930’s comedy sketch, it works almost every time, and the researchers used an ‘auxiliary’ AI to help create optimized prompts to fool the LLM. They discerned that tokens with malicious intent amplified the LLM’s rejection signals while benign tokens weaken it. Essentially by forcing the model to generate long chains of benign reasoning, harmful tokens count for only a small share of the ‘attention’ given to the context, diluting the rejection signal below a realistic threshold. This goes against the thought that ‘more inference leads to greater robustness’. In fact lengthening the inference chain may increase security failures.
Conclusion: The Paradox of LLM Reasoning Progress
The introduction of Chain-of-Thought (CoT) reasoning marks a significant advancement, pushing Large Language Models (LLMs) closer to true problem-solving by enhancing their accuracy through explicit, step-by-step logic. This transition from "fast-thinking" to deliberate, allocated computation.
However, the discovery of CoT Hijacking reveals a profound paradox in AI safety: the mechanism designed to enhance performance and, theoretically, strengthen safeguards has become a potent new point for attack. The research demonstrates that making the model "think more" by lengthening the reasoning chain with benign context dilutes the safety refusal signal, proving that greater inference depth does not automatically equate to greater robustness.
As models like Gemini 2.5 Pro and GPT o4 mini become more powerful, developers must urgently find ways to embed security mechanisms within the reasoning process itself. The focus is now on developing alignment strategies that scale with reasoning depth to ensure that while AI becomes more intelligent, it does not inadvertently become more vulnerable.
RSS Feed