Representation Engineering: How Researchers Are Learning to Read and Rewrite What LLMs Think

Somewhere inside every safety-tuned language model, there exists a direction in activation space that encodes the difference between compliance and refusal. Find that direction, and you can dial a model's willingness to answer sensitive questions up or down — without changing a single weight. This is the core premise of representation engineering, and it is rapidly becoming one of the most important tools in AI safety research.

Reading and Controlling the Latent Space

Representation engineering operates on a principle borrowed from neuroscience: if you want to understand a system, study its internal representations, not just its inputs and outputs. Bartoszcze, Munshi, Sukidi et al. (2025) provide the most comprehensive survey to date of this emerging field, dividing it into two complementary operations.

Representation Reading detects and extracts high-level concepts — honesty, harmfulness, power-seeking — from the model's latent space. The approach uses contrasting pairs of inputs (one exhibiting a concept, one not) to identify the activation directions that encode each concept. Representation Control then modifies these directions at inference time to steer the model's behavior toward desired outcomes.

The theoretical foundation is the Linear Representation Hypothesis: high-level concepts are encoded as approximately linear directions in a model's activation space. Evidence for this hypothesis comes from multiple sources — from the classic Word2Vec arithmetic ("king" - "man" + "woman" = "queen") to modern probing studies showing that syntax trees, named entities, and truthfulness can be recovered from LLM activations using linear classifiers.

The survey compares representation engineering with three alternatives: mechanistic interpretability (which decomposes models into individual circuits), prompt engineering (which manipulates behavior through input text), and fine-tuning (which modifies model weights). Representation engineering occupies a middle ground — more precise than prompting, less expensive than fine-tuning, and operating at a higher level of abstraction than mechanistic analysis.

Censorship as a Steering Problem

Cyberey and Evans (2025), published at COLM 2025, apply representation engineering to study how censorship works inside safety-tuned models. Their key contribution is the discovery of a "refusal-compliance" vector — a direction in activation space that controls the degree to which a model refuses or complies with requests.

Unlike prior work that treated refusal as binary (the model either refuses or it does not), Cyberey and Evans show that censorship exists on a continuum. By scaling the refusal-compliance vector, they can fine-tune the level of censorship in model outputs — making the model slightly more cautious or slightly more forthcoming without wholesale behavioral changes.

Their analysis of DeepSeek-R1 distilled models reveals a second, more subtle form of censorship: "thought suppression." In reasoning models that generate explicit chains of thought, certain topics trigger a vector that suppresses the reasoning process itself — the model produces empty or truncated reasoning traces before issuing a refusal. By applying the negative multiple of this suppression vector, the researchers restore the model's ability to reason about the suppressed topic.

This finding has implications beyond safety research. It demonstrates that representation engineering can access and modify cognitive processes — not just behavioral outputs — within language models.

Defense in Depth Through Layered Steering

If representation engineering can be used to circumvent safety measures, can it also be used to strengthen them? Thornton (2026) answers this with TRYLOCK, a defense-in-depth architecture that combines four complementary safety mechanisms operating at different levels of the inference stack.

The system layers DPO-based weight-level alignment, RepE activation steering, an adaptive sidecar classifier that dynamically adjusts steering strength based on per-input threat assessment, and input canonicalization to neutralize encoding-based attacks (Base64, ROT13, leetspeak).

Evaluated on Mistral-7B-Instruct against a 249-prompt attack set spanning five attack families, TRYLOCK achieves an 88.0% relative reduction in attack success rate (from 46.5% to 5.6%). Critically, systematic ablation reveals that each layer provides complementary, non-redundant protection: RepE blocks 36% of attacks that other layers miss, DPO catches 8% that bypass RepE, and canonicalization addresses 14% of encoding-based attacks that evade both.

The study also uncovers a surprising non-monotonic steering phenomenon: at intermediate steering strength (alpha equal to 1.0), safety actually degrades below the unsteered baseline. This RepE-DPO interference effect — where activation steering conflicts with weight-level alignment — has implications for any system that combines these two approaches.

The adaptive sidecar addresses a persistent complaint about safety mechanisms: over-refusal. By dynamically selecting steering strength based on threat classification, TRYLOCK reduces over-refusal from 60% to 48% while maintaining identical attack defense — demonstrating that security and usability need not be in tension.

Open Questions

Representation engineering faces several unresolved challenges. The Linear Representation Hypothesis holds approximately but not perfectly — how do we handle concepts that are encoded nonlinearly? Steering vectors discovered on one model may not transfer to another — how do we build universal concept representations? And the dual-use nature of these techniques is inherent: the same methods that improve safety can circumvent it.

The TRYLOCK result that intermediate steering can degrade safety is particularly consequential. It suggests that combining safety mechanisms is not simply additive — interactions between layers can create unexpected vulnerabilities that neither layer exhibits alone.

Looking Forward

Representation engineering is maturing from a research curiosity into a practical safety toolkit. The progression from theoretical surveys to deployed defense architectures, achieved within a single year, suggests rapid adoption. As models become more capable and the stakes of misalignment grow, the ability to read and control what models represent internally — rather than merely hoping that training produces the right behavior — may prove essential.