Deep DiveConcept Engineering

Vortices, Proteins, and Melodies: How Concept Engineering Is Escaping the Text Box

The interpretability tools built for language models — sparse autoencoders, activation steering, concept probing — are proving equally effective on physics simulators, protein models, and music generators. The convergence suggests something universal about how foundation models organize knowledge.

By OrdoResearch
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The tools developed to peer inside large language models — probing classifiers, sparse autoencoders, activation steering — were built for text. They find concepts like "truthfulness" or "sentiment" encoded as directions in a language model's activation space, and then manipulate those directions to control the model's behavior. A striking development in 2025 is the discovery that these same techniques work on foundation models that have never processed a single word of natural language. Physics simulators, protein sequence models, and music generators all appear to organize their internal representations around steerable concept directions — and this convergence is reshaping our understanding of what foundation models actually learn.

Steering the Physics of a Simulation

Fear, Mukhopadhyay, McCabe et al. (2025), working with the PolymathicAI collaboration at Cambridge and the Simons Foundation, apply activation steering to Walrus, a large physics foundation model trained on numerical simulations across multiple physical domains. The question they pose is precise: do scientific foundation models learn internal representations that align with physical concepts the way language models learn representations that align with semantic ones?

The answer is yes, and the evidence is direct. By computing the difference between model activations when processing contrasting physical regimes — a turbulent flow versus a laminar one, a rapidly diffusing system versus a slowly diffusing one — the researchers identify single directions in activation space that correspond to recognizable physical phenomena. One direction encodes vorticity. Another encodes diffusion rate. A third captures temporal progression — the sense of how far a simulation has advanced in time.

The critical finding is that these concept directions are not merely diagnostic but causal. Injecting a vorticity direction into the model during inference induces vortices in the output simulation. Suppressing it removes them. Diffusion can be amplified or dampened. Simulations can be accelerated or decelerated. The model is not simply recognizing these phenomena — it has organized its internal representations so that each phenomenon corresponds to a manipulable control dimension.

More remarkable still, these concept directions transfer between unrelated physical systems. A vorticity direction extracted from fluid dynamics simulations successfully induces rotational behavior when applied to completely different physical systems. This cross-domain transferability suggests that the model has learned general representations of physical principles rather than system-specific shortcuts — providing evidence for the Linear Representation Hypothesis in the domain of scientific computing, far from its origins in natural language processing.

Decoding Protein Biology Through Sparse Autoencoders

Adams, Bai, Lee et al. (2025), published at ICML 2025, take the interpretability toolkit in a different direction entirely: into molecular biology. They train sparse autoencoders on the residual stream of ESM-2, a protein language model trained on millions of protein sequences. The goal is to determine whether the features discovered by SAEs — originally designed to decompose LLM representations into interpretable components — reveal meaningful protein biology.

The features they discover organize into two categories. Generic features activate across many protein families, encoding broad properties like sequence composition or structural tendency. Family-specific features activate selectively for particular protein families, capturing the distinctive sequence signatures that distinguish one functional group from another. This dual representation — general plus specific — mirrors what has been observed in language models, where SAEs find both broad semantic features and narrow topical ones.

The practical value emerges through linear probing. The researchers demonstrate that SAE features can identify known sequence determinants of thermostability — the property that determines whether a protein remains functional at high temperatures — and subcellular localization, which determines where in the cell a protein operates. These are properties of immense practical importance in protein engineering and drug design. For features without known functional associations, the authors hypothesize their roles in previously uncharacterized biological mechanisms, positioning SAE-based interpretability not merely as a tool for understanding models but as a tool for generating biological hypotheses.

The implication is that the protein language model has learned, through unsupervised training on sequences alone, representations that capture genuine biological function. The SAE decomposes these representations into interpretable components that biologists can examine, validate, and use to guide experiments — turning mechanistic interpretability into mechanistic biology.

Patching Creativity in Music Generation

Facchiano et al. (2025) extend activation patching — another technique from the LLM interpretability toolkit — to music generation models. Activation patching works by replacing the activations at specific positions and layers during inference with activations from a different input, allowing researchers to identify which components of the network are causally responsible for particular output properties.

Applied to music, this approach enables interpretable steering of generated compositions: identifying which internal representations control tempo, instrumentation, harmonic structure, or genre characteristics, and then intervening on those representations to guide generation. The musical domain is particularly interesting for concept engineering because the relevant concepts (rhythm, melody, harmony) are well-defined by music theory but not easily expressed in text — making this a genuine test of whether concept steering works through modality-independent mechanisms rather than linguistic intermediation.

The Convergence

The simultaneous success of concept engineering techniques across physics, biology, and music suggests something deeper than a collection of domain-specific results. Foundation models trained on sufficiently large and diverse datasets — whether the data consists of words, physical simulations, protein sequences, or musical scores — appear to converge on similar representational strategies. Concepts emerge as linear directions. These directions are steerable. And the steering transfers across contexts within the same domain.

This convergence validates the core premise of representation engineering: that the internal geometry of foundation models encodes meaningful structure, and that this structure can be read, interpreted, and manipulated. The fact that the same tools work across domains that share no surface similarity raises the possibility of a universal science of foundation model interpretability — one that applies wherever models learn rich representations from large-scale data, regardless of the modality.

The open question is whether this convergence reflects something fundamental about how neural networks organize information, or whether it reflects the specific training regimes and architectures that current foundation models share. Either way, the practical consequence is clear: the interpretability techniques developed for language models are not limited to language. They are tools for understanding any system that learns to represent the world.


References

  • Fear, R., Mukhopadhyay, P., McCabe, M., Bietti, A., & Cranmer, M. (2025). Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model. arXiv. arXiv:2511.20798
  • Adams, E., Bai, L., Lee, M., Yu, Y., & AlQuraishi, M. (2025). From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models. ICML 2025. DOI:10.1101/2025.02.06.636901
  • Facchiano, S. et al. (2025). Activation Patching for Interpretable Steering in Music Generation. arXiv. Google Scholar
  • References (5)

    Fear, R., Mukhopadhyay, P., McCabe, M., Bietti, A., & Cranmer, M. (2025). Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model. arXiv. [arXiv:2511.20798](https://arxiv.org/abs/2511.20798).
    Adams, E., Bai, L., Lee, M., Yu, Y., & AlQuraishi, M. (2025). From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models. ICML 2025. [DOI:10.1101/2025.02.06.636901]().
    Facchiano, S. et al. (2025). Activation Patching for Interpretable Steering in Music Generation. arXiv. [Google Scholar](https://scholar.google.com/scholar?q=Activation%20Patching%20for%20Interpretable%20Steering%20in%20Music%20Generation).
    Fear et al.. Physics Steering: Causal Control of Cross-Domain Concepts.
    Facchiano et al.. Activation Patching for Interpretable Steering in Music Generation.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 7 keywords →