Response Attack: Unveiling a New Contextual Priming Jailbreak for LLMs

By Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao


Published on November 24, 2025| Vol. 1, Issue No. 1

Summary

Researchers have discovered a novel jailbreaking technique called \"Response Attack\" (RA) that exploits contextual priming in Large Language Models (LLMs). Unlike traditional methods that rely on single or multi-turn prompt manipulations, RA strategically injects intermediate, mildly harmful responses into a dialogue. These responses act as \"primers,\" covertly biasing the LLM's subsequent behavior towards policy-violating content when a final trigger prompt is issued. Extensive experiments across eight state-of-the-art LLMs demonstrate that RA consistently achieves significantly higher attack success rates than nine leading jailbreak baselines, enabling the generation of more explicit and relevant harmful content while maintaining stealth and efficiency.

Why It Matters

The emergence of Response Attack marks a significant escalation in the sophistication of LLM jailbreaking techniques, fundamentally altering the threat landscape for AI safety and security. This research unveils a critical and previously overlooked vulnerability: the susceptibility of LLMs to subtle, self-induced contextual priming within a conversational flow. It signifies a shift from direct, explicit prompt manipulations to more nuanced, state-dependent attacks that leverage the model's own generated content to undermine safety guardrails.

For AI professionals, this is a wake-up call. It highlights that current safety mechanisms, often focused on static prompt filtering or in-context learning, are insufficient against dynamic, multi-turn exploits that manipulate an LLM's internal \"state\" or \"mindset.\" Developing robust LLMs now requires a deeper understanding of how models process and are influenced by ongoing dialogue, moving beyond simple input-output analysis. Practitioners must consider designing and evaluating LLMs with more resilient memory and context management, potentially incorporating mechanisms to detect and neutralize subtle priming effects. Furthermore, it emphasizes the urgent need for advanced red-teaming methodologies that mimic this new breed of sophisticated, stealthy attacks, ensuring that safety and alignment efforts keep pace with the evolving ingenuity of adversaries. Failing to address such vulnerabilities could lead to increasingly effective and harder-to-detect generation of harmful, biased, or misleading content, eroding public trust and posing significant ethical and societal risks.

Advertisement