Methodology GuideAI & Machine Learning

GUI Agents: Why Architecture Beats Model Size for Browser Automation

Can an AI agent reliably browse the web on your behalf? Vardanyan (2025) finds that architectural choices — context management, tool design, and programmatic security — matter more than model size. The agent achieves approximately 85% on WebGames with 53 challenges, compared to approximately 50% for prior agents.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

An AI system that operates a web browser — clicking buttons, filling forms, navigating pages — sounds straightforward. The execution is not. Web pages are visually complex, dynamically rendered, inconsistently structured, and occasionally hostile.

Vardanyan (2025) approaches this from an engineering-first perspective, and the central finding is counterintuitive: architectural decisions determine agent success or failure more than the underlying LLM's capability.

Research Landscape: The GUI Agent Problem

Browser automation has existed for decades via scripted tools (Selenium, Playwright). What distinguishes current browser agents is their ability to handle unseen websites — requiring visual layout understanding, UI interpretation, and navigation decisions based on page state rather than pre-programmed rules.

Prior agents achieved approximately 50% success on the WebGames benchmark, a suite of 53 diverse browser challenges, while human performance sits at approximately 95.7%. The gap suggests raw model capability is not the primary bottleneck — if it were, scaling should close it. It has not.

The Architecture-First Approach

Vardanyan's system achieves approximately 85% success on WebGames — a substantial improvement over the approximately 50% baseline — through three architectural decisions rather than through larger models.

Hybrid Context Management

The first decision is how to represent the current state of a web page to the LLM. Pure screenshot-based approaches lose structural information (which element is clickable, what text is in which input field). Pure DOM-based approaches produce verbose, often incomprehensible context that quickly exceeds token limits.

The hybrid approach combines accessibility tree snapshots with selective vision. The accessibility tree — a browser's machine-readable representation of page structure, originally designed for screen readers — provides a compact, semantically meaningful summary of interactive elements. Selective vision supplements this with screenshot analysis only when the accessibility tree is insufficient (e.g., for CAPTCHAs, charts, or unconventional layouts).

This design reduces context length while preserving the information the agent actually needs for navigation decisions.

Comprehensive Browser Tooling

The second decision is the agent's action space. Rather than limiting the agent to basic click-and-type operations, Vardanyan provides a comprehensive toolkit that includes scrolling, hovering, waiting for page loads, selecting from dropdowns, and executing JavaScript for edge cases. The toolkit is designed to match what a human can do with a browser — no more, no less.

The critical choice is making these tools programmatic rather than freeform — the agent selects from a defined set of actions with structured parameters, constraining the action space to reduce errors while covering the vast majority of browsing tasks.

Programmatic Security

The third and perhaps most practically important decision concerns security. The paper argues that safety boundaries should be enforced through code rather than LLM reasoning. This is a strong architectural claim: rather than instructing the LLM "do not visit malicious websites" or "do not enter credentials on phishing pages," the system implements hard constraints in the agent framework itself.

The reasoning is direct. Prompt injection attacks — where a web page contains text designed to hijack the agent's behavior — undermine any safety mechanism that relies on the LLM's judgment. If the LLM is responsible for deciding what is safe, an adversarial page can convince it that unsafe actions are safe. Programmatic constraints (URL allowlists, action rate limits, credential vaults with access controls) cannot be socially engineered.

Critical Analysis: Claims and Evidence

Claim	Source	Assessment
~85% success on WebGames (53 challenges)	Benchmark evaluation	Supported; represents substantial gain over ~50% prior baseline
Architecture matters more than model size	Comparative analysis	Supported for the tested benchmark; may not generalize to all task types
Prompt injection makes general-purpose autonomous browsing fundamentally unsafe	Security analysis	Strongly argued; the attack vector is well-documented in the security literature
Programmatic security is more robust than LLM-based judgment	Architectural argument	Supported by reasoning; empirical evaluation of adversarial robustness not comprehensive

The 85% Ceiling

The approximately 85% figure leaves a 10-point gap to human performance. Understanding what the remaining 15% consists of matters: are these perception errors, planning errors, or execution errors? The paper discusses failure modes but does not systematically categorize them. Additionally, WebGames is a curated benchmark — real-world browsing involves ads, pop-ups, cookie consent banners, and A/B test variations that may not be fully represented.

The Security Argument

The security section challenges a common assumption: that agent safety can be achieved through better prompting or RLHF. Vardanyan argues this is architecturally incorrect for browser agents. The reasoning is concise: browser agents operate in an adversarial environment (the open web), adversaries control the content agents perceive (via web pages), and LLM-based safety reasoning is susceptible to adversarial content (prompt injection). Therefore, safety must be enforced by a mechanism adversaries cannot influence — namely, code.

This is an instance of a broader principle: security should not depend on the component being secured. An LLM that is both agent and safety monitor is a guard that can be bribed.

Open Questions

Generalization beyond WebGames: How does the architecture perform on real-world tasks with dynamic content and authentication flows?

Context window scaling: As LLM context windows grow to 1M+ tokens, does the hybrid context approach remain advantageous, or does brute-force context inclusion become viable?

Accessibility tree reliability: On poorly-structured pages (no ARIA labels, JavaScript-heavy rendering), does the approach degrade?

Adversarial evaluation: What percentage of known prompt injection techniques bypass the programmatic constraints?

What This Means for Practitioners

For teams building browser automation, invest in architecture before model capability. Use accessibility trees as the primary page representation, supplement with vision selectively, provide structured action primitives, and implement security constraints in the framework rather than the prompt. The broader principle — that scaffolding matters more than the foundation model for structured tasks — is likely to hold across GUI automation.

면책 조항: 이 게시물은 정보 제공 목적의 연구 개요이다. 학술 연구에서 인용하기 전에 원본 논문을 통해 구체적인 연구 결과, 통계 및 주장을 검증해야 한다.

GUI 에이전트: 브라우저 자동화에서 모델 크기보다 아키텍처가 중요한 이유

웹 브라우저를 조작하여 버튼을 클릭하고, 양식을 작성하며, 페이지를 탐색하는 AI 시스템은 단순해 보인다. 하지만 실제 구현은 그렇지 않다. 웹 페이지는 시각적으로 복잡하고, 동적으로 렌더링되며, 구조가 일관되지 않고, 때로는 악의적이기까지 하다.

Vardanyan(2025)은 엔지니어링 우선 관점에서 이 문제에 접근하며, 핵심 연구 결과는 직관에 반한다: 아키텍처적 결정이 기반 LLM의 역량보다 에이전트의 성패를 더 크게 좌우한다.

연구 현황: GUI 에이전트 문제

브라우저 자동화는 Selenium, Playwright와 같은 스크립트 기반 도구를 통해 수십 년간 존재해 왔다. 현재의 브라우저 에이전트를 차별화하는 것은 미지의 웹사이트를 처리하는 능력으로, 이는 사전에 프로그래밍된 규칙이 아닌 페이지 상태에 기반한 시각적 레이아웃 이해, UI 해석, 그리고 탐색 결정을 필요로 한다.

53개의 다양한 브라우저 과제로 구성된 WebGames 벤치마크에서 기존 에이전트들은 약 50%의 성공률을 달성한 반면, 인간의 성능은 약 95.7%에 달한다. 이 격차는 원시 모델 역량이 주된 병목이 아님을 시사한다 — 만약 그렇다면, 모델 규모 확장을 통해 격차가 좁혀졌어야 한다. 하지만 그렇지 않았다.

아키텍처 우선 접근법

Vardanyan의 시스템은 더 큰 모델이 아닌 세 가지 아키텍처적 결정을 통해 WebGames에서 약 85%의 성공률을 달성하였으며, 이는 약 50%의 기준선 대비 상당한 개선이다.

하이브리드 컨텍스트 관리

첫 번째 결정은 웹 페이지의 현재 상태를 LLM에 어떻게 표현하느냐이다. 순수 스크린샷 기반 접근법은 구조적 정보(어떤 요소가 클릭 가능한지, 어떤 입력 필드에 어떤 텍스트가 있는지)를 잃는다. 순수 DOM 기반 접근법은 장황하고 종종 이해하기 어려운 컨텍스트를 생성하여 토큰 한도를 빠르게 초과한다.

하이브리드 접근법은 접근성 트리 스냅샷과 선택적 비전을 결합한다. 접근성 트리 — 원래 스크린 리더를 위해 설계된 브라우저의 기계 판독 가능한 페이지 구조 표현 — 는 상호작용 요소에 대한 간결하고 의미론적으로 유의미한 요약을 제공한다. 선택적 비전은 접근성 트리가 불충분한 경우(예: CAPTCHA, 차트, 또는 비관습적 레이아웃)에만 스크린샷 분석으로 이를 보완한다.

이 설계는 에이전트가 탐색 결정에 실제로 필요한 정보를 보존하면서 컨텍스트 길이를 줄인다.

포괄적인 브라우저 도구 지원

두 번째 결정은 에이전트의 행동 공간이다. 에이전트를 기본적인 클릭 및 타이핑 작업으로 제한하는 대신, Vardanyan은 스크롤, 호버링, 페이지 로드 대기, 드롭다운 선택, 그리고 엣지 케이스를 위한 JavaScript 실행을 포함하는 포괄적인 도구 세트를 제공한다. 이 도구 세트는 인간이 브라우저로 할 수 있는 작업과 일치하도록 설계되었으며, 그 이상도 이하도 아니다.

핵심적인 선택은 이러한 도구들을 자유형식이 아닌 프로그래매틱 방식으로 만드는 것이다 — 에이전트는 구조화된 매개변수를 가진 정의된 행동 세트에서 선택함으로써, 대부분의 브라우징 작업을 커버하면서 오류를 줄이기 위해 행동 공간을 제한한다.

프로그래매틱 보안

세 번째, 그리고 실용적 관점에서 아마도 가장 중요한 결정은 보안에 관한 것이다. 이 논문은 안전 경계는 LLM의 추론이 아닌 코드를 통해 강제되어야 한다고 주장한다. 이는 강력한 아키텍처적 주장이다: LLM에 "악성 웹사이트를 방문하지 말라" 또는 "피싱 페이지에 자격증명을 입력하지 말라"고 지시하는 대신, 시스템은 에이전트 프레임워크 자체에 하드 제약을 구현한다. 추론은 직접적이다. 프롬프트 인젝션 공격(웹 페이지가 에이전트의 동작을 가로채도록 설계된 텍스트를 포함하는 공격)은 LLM의 판단에 의존하는 모든 안전 메커니즘을 무력화한다. LLM이 안전한 것을 결정하는 역할을 담당한다면, 적대적 페이지는 안전하지 않은 행동이 안전하다고 LLM을 설득할 수 있다. 프로그래밍적 제약(URL 허용 목록, 행동 빈도 제한, 접근 제어가 있는 자격증명 저장소)은 사회공학적 방법으로 우회될 수 없다.

비판적 분석: 주장과 근거

주장	출처	평가
WebGames(53개 과제)에서 ~85% 성공률	벤치마크 평가	지지됨; 기존 기준선 ~50% 대비 실질적 향상을 나타냄
아키텍처가 모델 크기보다 중요	비교 분석	테스트된 벤치마크에서는 지지됨; 모든 과제 유형으로 일반화되지 않을 수 있음
프롬프트 인젝션으로 인해 범용 자율 브라우징은 근본적으로 안전하지 않음	보안 분석	강력하게 논증됨; 해당 공격 벡터는 보안 문헌에서 잘 문서화되어 있음
프로그래밍적 보안이 LLM 기반 판단보다 견고함	아키텍처적 논거	추론에 의해 지지됨; 적대적 견고성에 대한 실증적 평가는 포괄적이지 않음

85% 상한선

약 85%라는 수치는 인간 성능과 10포인트 차이를 남긴다. 나머지 15%가 무엇으로 구성되어 있는지를 이해하는 것이 중요하다: 이것이 지각 오류인지, 계획 오류인지, 아니면 실행 오류인지의 문제이다. 논문은 실패 양상을 논의하지만 이를 체계적으로 분류하지는 않는다. 또한 WebGames는 선별된 벤치마크로서, 실제 브라우징에는 광고, 팝업, 쿠키 동의 배너, A/B 테스트 변형 등이 포함되며 이것들이 완전히 반영되지 않을 수 있다.

보안 논거

보안 섹션은 하나의 일반적인 가정에 도전한다: 에이전트 안전성이 더 나은 프롬프팅이나 RLHF를 통해 달성될 수 있다는 가정이다. Vardanyan은 이것이 브라우저 에이전트에 대해 아키텍처적으로 잘못된 것이라고 주장한다. 추론은 간결하다: 브라우저 에이전트는 적대적 환경(오픈 웹)에서 운용되고, 적대자는 에이전트가 인식하는 콘텐츠(웹 페이지를 통해)를 제어하며, LLM 기반 안전 추론은 적대적 콘텐츠(프롬프트 인젝션)에 취약하다. 따라서 안전성은 적대자가 영향을 미칠 수 없는 메커니즘, 즉 코드에 의해 강제되어야 한다.

이것은 더 광범위한 원칙의 한 사례이다: 보안은 보안 대상이 되는 구성 요소에 의존해서는 안 된다. 에이전트이자 안전 모니터인 LLM은 매수될 수 있는 경비원이다.

미해결 질문

WebGames를 넘어선 일반화: 동적 콘텐츠와 인증 흐름이 있는 실제 과제에서 이 아키텍처는 어떤 성능을 보이는가?

컨텍스트 윈도우 확장: LLM 컨텍스트 윈도우가 100만 토큰 이상으로 증가함에 따라, 하이브리드 컨텍스트 접근 방식이 여전히 유리한가, 아니면 무차별적 컨텍스트 포함이 실행 가능해지는가?

접근성 트리 신뢰성: 구조가 불량한 페이지(ARIA 레이블 없음, JavaScript 집약적 렌더링)에서 이 접근 방식은 성능이 저하되는가?

적대적 평가: 알려진 프롬프트 인젝션 기법 중 몇 퍼센트가 프로그래밍적 제약을 우회하는가?

실무자에 대한 시사점

브라우저 자동화를 구축하는 팀의 경우, 모델 역량보다 아키텍처에 먼저 투자하라. 접근성 트리를 기본 페이지 표현으로 사용하고, 비전은 선택적으로 보완하며, 구조화된 행동 프리미티브를 제공하고, 보안 제약은 프롬프트가 아닌 프레임워크에서 구현하라. 더 광범위한 원칙, 즉 구조화된 과제에서는 스캐폴딩이 기반 모델보다 중요하다는 원칙은 GUI 자동화 전반에 걸쳐 유효할 가능성이 높다.

References (1)

[1] Vardanyan, A. (2025). Building Browser Agents: Architecture, Security, and Practical Solutions. arXiv:2511.19477.

DOI Scholar