Deep DiveAI & Machine Learning

Circuit Tracing: Anthropic Makes LLM Thinking Visible

Anthropic's circuit tracing produces computational graphs showing how language models transform inputs into outputs. The method reveals multi-hop reasoning pathways, poetry pre-selection mechanisms, and medical diagnosis representations inside Claude 3.5 Haiku — a concrete step toward making black-box models legible.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Large language models produce text that is often impressive and occasionally wrong, but in both cases the process that produced the output remains opaque. We observe the input and the output; the computation in between is a black box of billions of parameters interacting across dozens of layers. Anthropic's circuit tracing work attempts to open that box — not metaphorically, but literally, by producing computational graphs that trace how specific inputs are transformed, step by step, into specific outputs. The results do not explain everything. But they reveal enough to challenge the assumption that understanding these models is a lost cause.

The Research Landscape

The Problem: Polysemanticity and Superposition

Before circuit tracing makes sense, the obstacle it overcomes needs to be clear. Individual neurons in a language model do not represent clean concepts. A single neuron might activate for French text, discussions of cooking, and the color blue — a phenomenon called polysemanticity. The model superimposes multiple concepts onto each neuron because it has more concepts than neurons, making it nearly impossible to trace information flow by following individual neurons. Previous interpretability work used sparse autoencoders to decompose activations into interpretable "features." Circuit tracing builds on this foundation but goes further.

The Method: Attribution Graphs via Cross-Layer Transcoders

Ameisen, Lindsey, Pearce, Gurnee, and collaborators at Anthropic introduce attribution graphs — directed graphs where nodes represent active features, token embeddings, reconstruction errors, and output logits, while edges represent linear effects between nodes. The key methodological innovation is the use of a cross-layer transcoder — a replacement component that substitutes for parts of the model's multi-layer perceptrons.

The cross-layer transcoder addresses polysemanticity by mapping neuron activations across layers into a more interpretable feature space. Rather than asking "what does neuron 47 in layer 12 mean?", the method asks "what interpretable features are active at this point in the computation, and how do they influence downstream features?"

The resulting attribution graph traces the chain of intermediate steps that the model uses to transform a specific input prompt into an output response. Crucially, these graphs are prompt-specific — they show how the model processes this particular input, not how the model works in general. This is both a strength (concrete, verifiable) and a limitation (does not yield universal rules).

What the Graphs Reveal

The applied results are where circuit tracing becomes concrete. The researchers apply the method to Claude 3.5 Haiku and discover several distinct computational patterns:

Multi-hop reasoning pathways: When the model answers a question requiring multi-step inference — for example, "What country is the capital of France in?" requires knowing that Paris is the capital of France and that Paris is in France — the attribution graph shows distinct features activating in sequence: a "Paris-is-capital" feature feeds into a "France" feature, which feeds into the output. The hops are visible as distinct paths in the graph.

Poetry pre-selection: When generating rhyming text, the model does not simply produce words left-to-right and hope they rhyme. The attribution graph shows that features corresponding to rhyming words activate before the model has reached the position where the rhyming word will be produced. The model pre-selects the endpoint and works backward — a form of planning that was theorized but not previously observed at this level of detail.

Medical diagnosis representations: When the model processes a clinical vignette, the attribution graph shows features that correspond to symptoms, differential diagnoses, and ruling-out logic. These features interact in patterns that resemble (but are not identical to) the clinical reasoning taught in medical schools. The model has learned something like a diagnostic process from its training data.

Open-Sourced Tools

Anthropic has released the attribution graph tools and applied them to open-source models including Gemma-2 and Llama-3.2. This is a deliberate choice to make the method reproducible and to invite external verification. The open-sourcing matters for the field: interpretability claims are only as credible as the community's ability to replicate them.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Attribution graphs reveal interpretable computational structure in LLMs	Demonstrated on Claude 3.5 Haiku with specific examples	✅ Supported — examples are concrete and verifiable
Multi-hop reasoning follows distinct sequential feature activation	Attribution graph visualization of multi-hop queries	✅ Supported — visible in published graphs
Poetry generation involves pre-selection of rhyming endpoints	Feature activation timing analysis	✅ Supported — novel finding with clear mechanism
The method generalizes to open-source models (Gemma-2, Llama-3.2)	Tools released and applied to these models	✅ Supported — code available
Circuit tracing provides a complete account of model behavior	Not claimed; acknowledged as partial	⚠️ Explicitly acknowledged as incomplete

What Circuit Tracing Does Not Do

The method has clear limitations that the authors acknowledge. Attribution graphs are local — they explain one computation on one input, not the model's general behavior. The cross-layer transcoder is an approximation; it may miss interactions that do not decompose cleanly into linear effects. And interpretability of individual features still relies on human judgment — a feature labeled "Paris-is-capital" is labeled by researchers who inspected what activates it, and that labeling process is subjective.

There is also a selection effect in the published examples. The multi-hop reasoning and poetry cases are where the method works well. The publication does not quantify how often attribution graphs produce unintelligible or misleading results.

Open Questions and Future Directions

Scaling to larger models: Claude 3.5 Haiku is Anthropic's lightweight model. Can circuit tracing handle larger models where active features per prompt are far more numerous?

From description to intervention: Attribution graphs describe what the model does. Can they guide targeted edits — suppressing a specific reasoning pathway or strengthening a desired one?

Automated interpretation: Currently, human researchers label features and interpret graphs. Can this process be automated, perhaps by using another LLM to annotate features?

Safety applications: If circuit tracing can reveal deceptive reasoning, it could become a safety tool. But adversarial robustness of the interpretability method itself has not been tested.

What This Means for Your Research

If you work on model interpretability, circuit tracing represents a methodological advance worth engaging with. The open-sourced tools on Gemma-2 and Llama-3.2 provide a concrete starting point for replication and extension.

If you work on AI safety, the gap between "we can sometimes see what the model is doing" and "we can reliably detect dangerous behavior" remains large. Circuit tracing is a step, not a solution — but it is a concrete step with open-source tools you can start using today.

Explore related interpretability and safety research through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 리뷰이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

Circuit Tracing: Anthropic이 LLM의 사고 과정을 가시화하다

대규모 언어 모델은 종종 인상적이면서도 때로는 오류가 있는 텍스트를 생성하지만, 두 경우 모두 출력물을 만들어낸 과정은 불투명한 채로 남아 있다. 우리는 입력과 출력을 관찰할 수 있을 뿐, 그 사이의 연산은 수십 개의 레이어에 걸쳐 수십억 개의 파라미터가 상호작용하는 블랙박스이다. Anthropic의 circuit tracing 연구는 그 박스를 여는 시도이다 — 은유적으로가 아니라, 특정 입력이 단계별로 특정 출력으로 변환되는 과정을 추적하는 계산 그래프를 생성함으로써 말 그대로 실현한다. 이 결과물이 모든 것을 설명하지는 않는다. 그러나 이 모델들을 이해하는 것이 불가능하다는 가정에 의문을 제기하기에 충분한 것들을 드러낸다.

연구 배경

문제: Polysemanticity와 Superposition

Circuit tracing를 이해하기 전에, 그것이 극복하려는 장애물을 명확히 할 필요가 있다. 언어 모델의 개별 뉴런은 명확한 개념을 표상하지 않는다. 하나의 뉴런은 프랑스어 텍스트, 요리에 관한 논의, 그리고 파란색에 대해 활성화될 수 있는데, 이를 polysemanticity라고 한다. 모델은 뉴런보다 더 많은 개념을 가지고 있기 때문에 여러 개념을 각 뉴런에 중첩시키며, 이로 인해 개별 뉴런을 따라 정보 흐름을 추적하는 것이 거의 불가능해진다. 이전의 해석 가능성 연구들은 sparse autoencoder를 사용하여 활성화를 해석 가능한 "feature"로 분해하였다. Circuit tracing은 이 토대 위에 구축되지만 한 걸음 더 나아간다.

방법론: Cross-Layer Transcoder를 통한 Attribution Graph

Anthropic의 Ameisen, Lindsey, Pearce, Gurnee 및 공동 연구자들은 attribution graph를 도입한다 — 노드가 활성 feature, 토큰 임베딩, 재구성 오차, 출력 로짓을 나타내고, 엣지가 노드 간의 선형 효과를 나타내는 방향 그래프이다. 핵심적인 방법론적 혁신은 cross-layer transcoder의 사용으로, 이는 모델의 다층 퍼셉트론(multi-layer perceptron) 일부를 대체하는 교체 구성 요소이다.

Cross-layer transcoder는 여러 레이어에 걸친 뉴런 활성화를 보다 해석 가능한 feature 공간으로 매핑함으로써 polysemanticity 문제를 해결한다. "레이어 12의 뉴런 47은 무엇을 의미하는가?"라고 묻는 대신, 이 방법은 "연산의 이 시점에서 어떤 해석 가능한 feature가 활성화되어 있으며, 그것이 하위 feature에 어떤 영향을 미치는가?"라고 묻는다.

결과로 도출된 attribution graph는 모델이 특정 입력 프롬프트를 출력 응답으로 변환하는 데 사용하는 중간 단계들의 연쇄를 추적한다. 중요한 것은, 이 그래프들이 프롬프트별로 특정하다는 점이다 — 이는 모델이 일반적으로 어떻게 작동하는지가 아니라, 이 특정 입력을 모델이 어떻게 처리하는지를 보여준다. 이는 강점(구체적이고 검증 가능하다)인 동시에 한계(보편적인 규칙을 도출하지 못한다)이기도 하다.

그래프가 드러내는 것

응용 결과는 circuit tracing이 구체적으로 실현되는 지점이다. 연구자들은 Claude 3.5 Haiku에 이 방법을 적용하여 몇 가지 뚜렷한 계산 패턴을 발견한다:

Multi-hop 추론 경로: 모델이 다단계 추론을 요구하는 질문에 답할 때 — 예를 들어, "프랑스의 수도는 어느 나라에 있는가?"라는 질문은 파리가 프랑스의 수도이고 파리가 프랑스에 있다는 것을 알아야 한다 — attribution graph는 순서대로 활성화되는 뚜렷한 feature들을 보여준다: "파리-는-수도이다" feature가 "프랑스" feature로 이어지고, 이것이 출력으로 이어진다. 각 hop은 그래프에서 뚜렷한 경로로 가시화된다.

시 창작 시의 사전 선택: 운율이 있는 텍스트를 생성할 때, 모델은 단순히 왼쪽에서 오른쪽으로 단어를 생성하면서 운율이 맞기를 기대하지 않는다. Attribution graph는 운율을 맞추는 단어에 해당하는 feature들이 모델이 해당 운율 단어를 생성할 위치에 도달하기 이전에 활성화된다는 것을 보여준다. 모델은 종착점을 미리 선택하고 역방향으로 작업한다 — 이론적으로는 제시되었지만 이 수준의 세부 사항에서 이전에는 관찰되지 않았던 일종의 계획이다. 의료 진단 표현: 모델이 임상 증례를 처리할 때, 귀인 그래프는 증상, 감별 진단, 배제 논리에 해당하는 특징들을 나타낸다. 이러한 특징들은 의과대학에서 가르치는 임상 추론과 유사하지만 동일하지는 않은 패턴으로 상호작용한다. 모델은 훈련 데이터로부터 진단 과정과 유사한 무언가를 학습한 것이다.

오픈소스 도구

Anthropic은 귀인 그래프 도구를 공개하고 Gemma-2 및 Llama-3.2를 포함한 오픈소스 모델에 적용하였다. 이는 해당 방법론의 재현 가능성을 확보하고 외부 검증을 유도하기 위한 의도적인 선택이다. 오픈소스화는 이 분야에 있어 중요한 의미를 지닌다. 해석 가능성에 관한 주장은 커뮤니티가 이를 재현할 수 있는 능력만큼만 신뢰를 가질 수 있기 때문이다.

비판적 분석: 주장과 근거

주장	근거	판정
귀인 그래프는 LLM 내의 해석 가능한 계산 구조를 드러낸다	구체적인 사례와 함께 Claude 3.5 Haiku에서 실증됨	✅ 지지됨 — 사례가 구체적이고 검증 가능함
다중 홉 추론은 뚜렷한 순차적 특징 활성화를 따른다	다중 홉 쿼리의 귀인 그래프 시각화	✅ 지지됨 — 공개된 그래프에서 확인 가능함
시 생성은 운율 맞는 종결부의 사전 선택을 수반한다	특징 활성화 타이밍 분석	✅ 지지됨 — 명확한 메커니즘을 갖춘 새로운 발견
해당 방법론은 오픈소스 모델(Gemma-2, Llama-3.2)에도 일반화된다	도구가 공개되어 해당 모델에 적용됨	✅ 지지됨 — 코드 이용 가능
회로 추적은 모델 행동에 대한 완전한 설명을 제공한다	주장된 바 없으며, 부분적임을 인정함	⚠️ 불완전함을 명시적으로 인정함

회로 추적이 하지 못하는 것

이 방법론에는 저자들이 인정하는 명확한 한계가 있다. 귀인 그래프는 국소적이다. 즉, 하나의 입력에 대한 하나의 계산을 설명할 뿐, 모델의 일반적인 행동을 설명하지는 않는다. 교차 층 트랜스코더는 근사치이므로, 선형 효과로 깔끔하게 분해되지 않는 상호작용을 놓칠 수 있다. 또한 개별 특징의 해석 가능성은 여전히 인간의 판단에 의존한다. "파리는 수도이다(Paris-is-capital)"로 레이블이 붙은 특징은 해당 특징을 활성화시키는 것을 조사한 연구자들이 레이블을 붙인 것이며, 이 레이블링 과정은 주관적이다.

또한 공개된 사례들에는 선택 편향이 존재한다. 다중 홉 추론과 시 생성 사례는 이 방법론이 잘 작동하는 경우이다. 해당 논문은 귀인 그래프가 얼마나 자주 이해하기 어렵거나 오해를 유발하는 결과를 도출하는지에 대해서는 수치화하지 않는다.

미해결 과제 및 향후 방향

더 큰 모델로의 확장: Claude 3.5 Haiku는 Anthropic의 경량 모델이다. 프롬프트당 활성화 특징 수가 훨씬 많은 더 큰 모델에서도 회로 추적이 가능할 것인가?

기술에서 개입으로: 귀인 그래프는 모델이 무엇을 하는지 기술한다. 이를 통해 특정 추론 경로를 억제하거나 원하는 경로를 강화하는 등의 표적 편집을 안내할 수 있는가?

자동화된 해석: 현재 인간 연구자들이 특징에 레이블을 붙이고 그래프를 해석한다. 예를 들어 다른 LLM을 사용하여 특징에 주석을 다는 방식으로 이 과정을 자동화할 수 있는가?

안전성 응용: 회로 추적이 기만적 추론을 드러낼 수 있다면 안전성 도구가 될 수 있다. 그러나 해석 가능성 방법론 자체의 적대적 견고성은 아직 검증되지 않았다.

연구에 주는 시사점

모델 해석 가능성 연구를 수행한다면, 회로 추적은 적극적으로 다룰 만한 방법론적 진전을 나타낸다. Gemma-2 및 Llama-3.2에 대한 오픈소스 도구는 재현 및 확장을 위한 구체적인 출발점을 제공한다.

AI 안전성 연구를 수행한다면, "모델이 무엇을 하는지 때로 파악할 수 있다"와 "위험한 행동을 신뢰할 수 있게 탐지할 수 있다" 사이의 간극은 여전히 크다. 회로 추적은 해결책이 아닌 하나의 발걸음이다. 그러나 오늘 당장 사용을 시작할 수 있는 오픈소스 도구를 갖춘 구체적인 발걸음이다.

ORAA ResearchBrain을 통해 관련 해석 가능성 및 안전성 연구를 살펴보라.

References (1)

[1] Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W. et al. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. Anthropic, transformer-circuits.pub. https://transformer-circuits.pub/2025/attribution-graphs/methods.html.

Scholar