Do different neural networks learn the same concepts? If a vision model trained by Google and one trained by Meta both recognize "the face of a dog in a crowd," do they represent that concept in comparable ways? The answer has profound implications: if concepts are universal across architectures, we can build interpretability tools that transfer between models rather than starting from scratch for each one. A line of work published in 2025 suggests that the answer is yes — and provides the tools to prove it.
The Universality Hypothesis
The idea that different neural networks converge on similar internal representations is not new. Research on "Rosetta Neurons" has found correlated activations across models, and theoretical arguments suggest that foundation models trained on sufficiently large datasets should converge toward a shared "Platonic representation" of the world. But demonstrating this empirically — in a way that produces actionable, interpretable concept alignments — has remained challenging.
Existing methods for comparing model representations operate post-hoc: train separate sparse autoencoders on each model, extract features independently, then try to match them through expensive filtering or optimization. This approach is computationally prohibitive for many models and fundamentally cannot discover concepts that emerge only through joint analysis.
Universal Sparse Autoencoders
Thasarathan, Forsyth, Fel, Kowal, and Derpanis (2025), presented at ICML 2025, take a different approach. Their Universal Sparse Autoencoder (USAE) is trained simultaneously on the activations of multiple pretrained deep neural networks, learning a single shared concept space that can encode and reconstruct any model's internal representations.
The architecture uses model-specific encoders that map each model's activations into a shared sparse code Z, which is then decoded back to each model's activation space through model-specific decoders. By optimizing a shared objective across models, the USAE forces concept alignment during training rather than discovering it after the fact.
Applied to three diverse vision models — DinoV2, SigLIP, and ViT — the system discovers semantically coherent universal concepts ranging from low-level features (colors, textures, curves) to high-level structures (animal haunches, human faces in crowds). The analysis reveals a strong correlation between concept universality and importance: concepts that appear across all models tend to be the ones that matter most for downstream tasks.
A novel application called Coordinated Activation Maximization enables simultaneous visualization of how different models encode the same concept — revealing both the shared structure and the model-specific variations in concept representation. The study also provides evidence that DinoV2 admits unique features not found in other models, while universal training uncovers shared representations that model-specific SAE training misses entirely.
Cross-Model and Cross-Modal Alignment
Complementary work extends concept alignment beyond models of the same type. Nasiri-Sarvi, Rivaz, and Hosseini (2025) introduce SPARC, a framework for concept-aligned sparse autoencoders that works across both different models and different modalities. Where USAEs align vision models with each other, SPARC aims to align representations across vision and language models — a capability with implications for multimodal AI systems where interpretability must bridge the gap between how an image encoder and a language decoder represent the same concept.
Puri, Berend, and Lapuschkin (2025) approach the same problem from a different angle with Atlas-Alignment, a method for making interpretability findings transferable across language models. Once a researcher identifies an interesting feature circuit in one LLM, Atlas-Alignment enables mapping that finding to corresponding features in a different LLM — potentially reducing the interpretability effort from per-model to per-concept.
Why Universal Concepts Matter
The practical implications extend beyond academic interpretability research. If concepts are genuinely universal across architectures, several capabilities become possible:
Transferable safety audits. Safety-relevant features identified in one model (deceptive reasoning, harmful knowledge) can be automatically located in other models, dramatically reducing the cost of safety evaluation across the model ecosystem.
Efficient model comparison. Rather than evaluating models through behavioral benchmarks alone, researchers can directly compare which concepts each model has learned and how prominently they feature in the model's computations.
Cross-model steering. Intervention techniques developed for one model — activation steering vectors, concept erasure methods — can potentially be transferred to other models through the shared concept space, creating reusable safety toolkits.
Open Questions
The universality hypothesis, while supported by growing evidence, has important caveats. Not all concepts are universal — DinoV2's unique features demonstrate that training choices create model-specific representations alongside shared ones. The current evidence comes primarily from vision models; whether language models exhibit the same degree of concept universality remains an open empirical question, though early results from cross-LLM alignment work are encouraging.
There is also a measurement question: are the concepts that USAEs discover genuinely shared, or are they artifacts of the shared training objective? Distinguishing genuine universality from imposed alignment requires careful experimental design.
Looking Forward
The convergence of universal SAEs, cross-modal alignment, and transferable interpretability tools points toward a future where understanding one model tells you something meaningful about all models. This would transform interpretability from a per-model expense into a cumulative science — where each insight builds on previous ones rather than starting fresh with every new architecture.
References
Thasarathan, H., Forsyth, J., Fel, T., Kowal, M., & Derpanis, K. G. (2025). Universal sparse autoencoders: Interpretable cross-model concept alignment. Proceedings of the 42nd International Conference on Machine Learning (ICML). arXiv:2502.03714.
Nasiri-Sarvi, A., Rivaz, H., & Hosseini, M. S. (2025). SPARC: Concept-aligned sparse autoencoders for cross-model and cross-modal interpretability. arXiv preprint, arXiv:2507.06265.