- Home
- The Thinking Wire
- Qwen's Censorship Was a Decal. Subtract One Vector and the Knowledge Comes Back.
Qwen's Censorship Was a Decal. Subtract One Vector and the Knowledge Comes Back.
A researcher at Vas-blog took Qwen3.5-9B, the open-weight Chinese model trained with explicit political guardrails, and located the censorship circuit. Not approximated it. Not theorized about it. Located it, isolated it, and turned it off with a single arithmetic operation on the model’s residual stream.
The censored model refuses to discuss Tiananmen, deflects on Tibet, parrots state lines on Taiwan. Subtract one direction vector from the activations at writer layers 11 through 20, and the same model produces detailed historical accounts of the same topics. The factual knowledge was always there. The refusal behavior was a thin overlay sitting on top of it.
This is not a jailbreak in the prompt-engineering sense. It is structural surgery. And it changes what we can claim about behaviorally-tested alignment.
What the Research Actually Mapped
The Vas-blog work, published in May 2026, used activation steering and probing techniques to decompose Qwen3.5’s refusal behavior into three orthogonal direction vectors operating in the residual stream.
The first is d_prc, a content detector that fires when a prompt touches People’s Republic of China-sensitive material. The second is d_refuse, the refusal decision vector that determines whether the model deflects at all. The third is d_style, a register toggle that selects between two trained refusal modes: bland evasion (“I cannot discuss this topic”) or active propaganda (“Taiwan has always been part of China since ancient times”).
These three vectors are linearly separable. You can subtract one without affecting the others. Push d_refuse negative and the model answers. Push d_style in either direction and you select which kind of refusal you get. Push d_prc to zero and the detector never fires in the first place, leaving the rest of the model’s safety machinery intact for genuinely harmful requests.
The clean dose-response curves are what should unsettle anyone responsible for model governance. Output snaps between behavioral registers as scalar multiples of these vectors are added. There is no fuzzy boundary. The alignment behavior is a switch, and the switch has a known location.
The Misfire That Gives the Game Away
Here is the detail that exposes what is really happening: the censorship circuit misfires structurally. When the researcher fed Qwen3.5 prompts about Kosovo (a geopolitical topic with zero PRC relevance) the model responded with the “Taiwan is part of China” template.
Think about what that means. The model is not reasoning about whether a topic is politically sensitive. It is pattern-matching against geographic and political vocabulary in a shallow way, then routing matches to a small set of trained denial scripts. The censorship is not grounded in semantic understanding of which topics are sensitive to which authorities. It is a keyword detector wired to a template selector.
This is consistent with what we argued in When Your AI Explains Its Reasoning, It’s Making It Up. The narratives models produce about their own behavior are post-hoc constructions, not faithful reports of internal computation. Qwen’s “answers” on Taiwan are not the model’s beliefs. They are template completions triggered by a detector that does not actually know what Taiwan is.
The over-steering result reinforces this. When the researcher pushed d_refuse past its trained range, the model did not start telling the truth. It snapped into a different trained template: a fabricated denial narrative the training process had baked in as a fallback. The honest answer was reachable only in a narrow operating band of the steering parameter. Outside that band, you get one of several rehearsed lies.
The Governance Implication Most People Will Miss
The obvious read on this research is “Chinese model has weak alignment, news at eleven.” That read is wrong on two counts.
First, the technique is not specific to Qwen. Activation steering and direction-vector isolation work on any transformer. Anthropic, OpenAI, and Google have all published interpretability work using similar primitives. There is no architectural reason to assume Western RLHF-trained models are structurally different. They were trained with the same mathematics on the same family of objective functions, just with different policy intents.
Second, and more importantly, this changes what behavioral auditing can prove. When a compliance team certifies a model as “aligned” based on red-team testing, they are measuring whether the refusal overlay fires in the right places. They are not measuring whether the underlying capability has been removed. Vas-blog’s work demonstrates that for at least one production-grade model, those are different things.
If the most heavily incentivized behavioral constraint in Qwen3.5 (political censorship, which the Chinese state cares about enough to mandate) is a thin overlay rather than capability removal, the prior on other RLHF-trained behaviors being similarly structured just got a lot stronger. Safety refusals. Tool-use restrictions. Persona constraints. Brand-voice enforcement. Any behavior trained by reward modeling on a base capability is a candidate for the same architectural pattern.
Why This Breaks the Current Audit Model
Most enterprise AI governance frameworks assume behavioral testing can substitute for mechanistic verification. The reasoning is pragmatic: mechanistic interpretability does not scale, but red-teaming does. So we accept behavioral evidence as proxy for structural compliance.
The Vas-blog result undermines that substitution at the foundation. Behavioral red-teaming can verify that a model refuses to do X. It cannot verify that the model cannot do X. Those are different claims, and the gap between them is exactly the surface where the Qwen technique operates.
In Anthropic’s 20x Sensitivity Lift, we covered how natural-language autoencoders are starting to make interpretability cheap enough to apply at audit scale. That work was about positioning interpretability as a governance asset, a tool that produces verifiable evidence. The Qwen research is the empirical challenge those tools now have to answer: not just “what is the model doing” but “what is the model capable of when its trained overlays are subtracted.”
A behavioral audit on Qwen3.5 would conclude the model has political guardrails. A mechanistic audit reveals those guardrails are removable in three lines of linear algebra. The two audits produce different governance recommendations. Right now, almost every enterprise is running the first kind.
What Buyers Should Demand Now
If you are procuring or licensing models for regulated deployment, this research justifies adding a new clause to your vendor questionnaire. Ask whether the vendor has performed mechanistic analysis of their safety behaviors. Ask whether they can demonstrate that trained refusals correspond to capability removal rather than capability gating. Ask whether they would commit to disclosing if internal probing revealed otherwise.
Most vendors will not have good answers today. That is itself information. A vendor that has not done this analysis is selling you behavioral compliance, not structural compliance. Price the difference into your risk model.
For internal teams running open-weight models, the implication is more direct. If your safety story depends on RLHF-trained refusals, that story has a known failure mode. Test against it. Run activation steering experiments on your fine-tuned models. See what comes back. The technique is documented and replicable, which means it is also available to adversaries.
Do This Now
Pick one model your organization treats as “safety-trained” and run a single mechanistic probe on its refusal behavior. Not a red-team prompt exercise. An actual activation analysis on a known-refused topic, using the techniques Vas-blog documented. Treat the result as a calibration data point: if behavioral compliance and structural compliance match, your audit model is sound. If they diverge, your audit model has been measuring the wrong thing, and you now have the evidence to redesign it before a regulator or adversary makes that case for you.
This analysis synthesizes What Political Censorship Looks Like Inside an LLM’s Weights (Vas-blog, May 2026).
Victorino Group helps risk and compliance teams move beyond black-box behavioral audit toward verifiable model governance. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation