AI & Machine Learning Research

Paper Review

Gemini 2.5 Pro's Thinking Budget: Controlling the Quality-Cost Tradeoff in Reasoning

Google's Gemini 2.5 Pro introduces a 'thinking budget' that gives users direct control over how much computation a model spends reasoning. We examine what this means for the quality-cost-latency triangle and whether user-controlled inference scaling changes the economics of AI deployment.

Geminithinking budgetreasoning

Critical Review

The Specification Trap: Why RLHF Is a Safety Measure, Not an Alignment Solution

RLHF, Constitutional AI, and inverse reinforcement learning are widely treated as alignment solutions. A philosophical analysis argues they are something more modest: safety measures that cannot, in principle, produce robust alignment under capability scaling. The distinction matters more than it might seem.

alignmentRLHFsafety

Deep Dive

Circuit Tracing: Anthropic Makes LLM Thinking Visible

Anthropic's circuit tracing produces computational graphs showing how language models transform inputs into outputs. The method reveals multi-hop reasoning pathways, poetry pre-selection mechanisms, and medical diagnosis representations inside Claude 3.5 Haiku — a concrete step toward making black-box models legible.

circuit tracinginterpretabilityAnthropic

Paper Review

Strong Model Collapse: When Synthetic Data Breaks Scaling Laws

The scaling laws that underpin modern LLM training assume clean data. What happens when the data is contaminated with AI-generated text? Two papers — one at ICLR 2025, one proposing a verification-based escape — show that even small fractions of synthetic data can break scaling and that verification offers a partial but imperfect remedy.

model collapsesynthetic datascaling laws

Deep Dive

Embodied World Models: Teaching Robots to Simulate Before They Act

Before acting in the physical world, an effective robot should be able to imagine the consequences. World models — internal simulators that predict how actions reshape future states — are becoming the central architecture for embodied AI. A comprehensive survey and a Meta/HKUST research agenda map the state of the art and the open problems.

world modelsembodied AIrobotics

Critical Review

The Specification Trap: Why RLHF and Constitutional AI Face Structural Limits

RLHF and Constitutional AI align language models by optimizing toward formal specifications — reward functions, constitutional principles, or preference representations — but Goodhart's Law, reward hacking, and specification gaming suggest that any content-based value alignment faces inherent structural limits as models scale.

RLHFspecification gamingGoodhart's Law

Paper Review

AlphaFold at Five Years: Boltz-2 and the Push Toward Binding Affinity

Five years after AlphaFold2 solved protein structure prediction, the field's frontier has shifted to biomolecular complexes and binding affinity — where AlphaFold3, Boltz-1 (open-source), and Boltz-2 represent successive steps toward the drug discovery application that structural biology always promised.

AlphaFoldprotein structure predictionBoltz-1

Methodology Guide

EVA: Variance-Aware Initialization That Improves LoRA Across Tasks and Modalities

EVA (Explained Variance Adaptation) replaces LoRA's random initialization with a data-driven approach that captures the directions of highest variance in the pretrained weight matrices — yielding consistent improvements across language, vision, and reinforcement learning tasks without increasing inference cost.

LoRAEVAparameter-efficient finetuning

Deep Dive

Multiagent Finetuning: How One Base Model Becomes Many Specialized Agents

Multiagent Finetuning (MAFT) starts from a single base language model and produces multiple specialized agent copies that generate diverse reasoning chains — then uses inter-agent selection pressure to improve each agent beyond what single-model self-improvement can achieve, avoiding the collapse that plagues standard synthetic data training.

multiagent finetuningself-improvementdiverse reasoning

Paper Review

SmolVLM: How 256M-Parameter Multimodal Models Challenge 80B Giants

HuggingFace's SmolVLM achieves competitive multimodal performance at 256M parameters by rethinking image tokenization and model architecture — demonstrating that small vision-language models can match or approach models 100x their size on key benchmarks, enabling deployment on phones, robots, and edge devices.

SmolVLMvision-language modelsmall multimodal model

Critical Review

Is RAG Dead? Long-Context LLMs vs. the Retrieval-Augmented Future

Context windows now stretch past one million tokens. Does that make retrieval-augmented generation obsolete? Two lines of research — GraphRAG and Agentic RAG — suggest the opposite: RAG is not dying, it is differentiating. We examine the evidence on both sides.

RAGlong contextGraphRAG

Methodology Guide

Speculative Decoding Meets Quantization: Compatible or Conflicting?

Speculative decoding and quantization both accelerate LLM inference, but do they work well together? Zhang et al. find that naive combinations can degrade performance, and propose a hierarchical framework achieving 2.78x speedup on quantized Llama-3-70B.

speculative decodingquantizationLLM inference

Methodology Guide

The MoE Takeover: Why a Majority of 2025's LLMs Use Mixture-of-Experts

Mixture-of-Experts has become the default LLM architecture in 2025, with models like DeepSeek-R1, Kimi K2, and Mistral Large adopting it. We examine how DeepSeekMoE's expert specialization strategies shaped this trend and what design choices make MoE work at scale.

MoEmixture of expertsLLM architecture

Methodology Guide

GUI Agents: Why Architecture Beats Model Size for Browser Automation

Can an AI agent reliably browse the web on your behalf? Vardanyan (2025) finds that architectural choices — context management, tool design, and programmatic security — matter more than model size. The agent achieves approximately 85% on WebGames with 53 challenges, compared to approximately 50% for prior agents.

browser agentsGUI automationcomputer use

Deep Dive

HyperGraphRAG: When Binary Knowledge Graphs Are Not Enough

Standard GraphRAG constrains knowledge to binary relations — one edge connecting two entities. HyperGraphRAG extends this to n-ary hyperedges, connecting multiple entities in a single relation. Experiments across medicine, agriculture, CS, and law show improvements over both standard RAG and GraphRAG.

HyperGraphRAGRAGknowledge graph

Trend Analysis

Deep Research Agents: The Rise of Autonomous AI Systems That Think, Search, and Synthesize

A new class of AI systems—deep research agents—can autonomously plan multi-step investigations, search across databases, and synthesize findings. With 71+ citations in months, this paradigm is reshaping how machines conduct scientific inquiry. We examine the architecture, evaluation gaps, and security risks.

deep research agentsautonomous AILLM agents

Paper Review

GraphRAG in 2025: When Should You Actually Use Graphs for Retrieval-Augmented Generation?

Graph-based RAG has exploded in popularity, but when does it actually outperform standard vector retrieval? Two surveys and two new frameworks reveal the answer is more nuanced than the hype suggests.

GraphRAGknowledge graphRAG

Paper Review

Constitutional Classifiers: Can We Build Universal Defenses Against LLM Jailbreaks?

Anthropic's Constitutional Classifiers represent a promising jailbreak defense—surviving thousands of hours of red teaming. But multi-turn attacks and autonomous red teamers are raising the stakes. We examine whether universal defense is achievable.

AI safetyjailbreak defenseconstitutional AI

Deep Dive

The Alignment Paradox: Why RLHF Reward Models Learn to Lie

RLHF has become the standard for aligning LLMs with human preferences—but reward models learn spurious shortcuts that produce fluent nonsense humans rate highly. Lambert's RLHF textbook and new causal reward methods reveal the depth of this alignment paradox.

RLHFreward hackingalignment

Paper Review

Vision-Language Foundation Models in Precision Oncology

A Nature paper on vision-language foundation models for cancer diagnosis signals that multimodal medical AI has crossed from research curiosity to clinical necessity.

vision-language modelfoundation modelmedical AI

Trend Analysis

Shrinking Giants: The Race to Run LLMs on Your Phone

The most powerful language models require data centers. But 2025's compression breakthroughs—vector quantization, entropy coding, and KV cache optimization—are making billion-parameter models viable on edge devices. The implications for privacy, latency, and access are profound.

on-device LLMedge inferencemodel compression

Trend Analysis

World Models for Autonomous Driving: When Diffusion Models Learn Physics

GAIA-2 introduces multi-view generative world models for autonomous driving, where diffusion models don't just generate video—they simulate physics. Combined with 4D consistency breakthroughs, this represents a new paradigm for self-driving simulation.

world modelsautonomous drivingvideo diffusion

Paper Review

After DeepSeek R1: How Reinforcement Learning Is Teaching LLMs to Think Harder

DeepSeek R1 proved that RL can unlock genuine reasoning in LLMs. Now the field is asking harder questions: how to maintain reasoning diversity, how to scale inference compute, and whether RL-trained reasoners actually understand or merely pattern-match.

LLM reasoningreinforcement learningDeepSeek R1

Paper Review

Machines Proving Theorems: Goedel-Prover and the IMO Gold Medal Frontier

Goedel-Prover achieves state-of-the-art open-source theorem proving in Lean 4, while Aristotle and Seed-Prover reach IMO competition level. The convergence of LLMs and formal verification is creating machines that don't just calculate—they prove.

automated theorem provingformal verificationLean

Deep Dive

The Foundation Model That Learned Earth's Climate System

A widely discussed AI paper in Nature is not about language or images—it's about Earth. This foundation model learns to predict weather, climate, and extreme events from a unified representation of the planet's physical systems.

foundation modelearth systemclimate AI

Paper Review

Safe RLHF-V: The Unsolved Problem of Making Multimodal AI Both Helpful and Harmless

Multimodal LLMs that see images and generate text face safety risks that text-only alignment cannot address. Safe RLHF-V proposes decoupled optimization of helpfulness and safety—but a sociotechnical critique argues the entire RLHF paradigm has fundamental limits.

safe RLHFmultimodal safetyMLLM alignment

Methodology Guide

Conformal Prediction: Distribution-Free Uncertainty That Actually Works

Most ML models give you a prediction but no reliable measure of how wrong it might be. Conformal prediction offers something remarkable: finite-sample coverage guarantees with no distributional assumptions. In 2025, the method is conquering its two remaining weaknesses—distribution shift and label corruption.

conformal predictionuncertainty quantificationdistribution shift

Paper Review

The Multi-Turn Attack Surface: Why Single-Turn Safety Tests Miss the Real Threats

LLMs that pass single-turn safety tests fail catastrophically in multi-turn conversations. MTSA demonstrates dramatic safety degradation over extended dialogues, while MUSE uses Monte Carlo Tree Search to systematically discover multi-turn attack paths. The implications for deployed conversational AI are urgent.

red teamingmulti-turn attackdialogue safety

Deep Dive

The Bias That Speaks: How LLMs Encode and Amplify Social Prejudice

LLMs don't just reflect societal biases—they systematize and amplify them. New research quantifies bias in sentiment analysis, proposes stereotype neutralization at the representation level, and reveals that debiasing methods designed for English fail in Chinese cultural contexts.

AI biasfairnessLLM discrimination

Paper Review

Mixture of Experts Goes Multimodal: Sparse Architecture for Dense Understanding

The Mixture of Experts architecture—where only a fraction of parameters activate per input—is expanding from language to multimodal domains. SkyMoE and RingMoGPT show how expert routing enables domain specialization without the cost of separate models.

mixture of expertsMoEmultimodal

Trend Analysis

LLM-Powered Tutors: Promise and Peril of AI in Personalized Education

Intelligent tutoring systems powered by LLMs can now diagnose knowledge gaps, generate adaptive learning paths, and provide real-time feedback. But do they actually improve learning—or just create an illusion of engagement? The evidence is more nuanced than EdTech marketing suggests.

AI educationintelligent tutoringpersonalized learning

Deep Dive

Moltbook: Inside the First Social Network Built Exclusively for AI Agents

What happens when AI agents get their own social network—and humans are merely spectators? Moltbook, the first platform designed exclusively for agent-to-agent interaction, has already produced emergent social behaviors that its creators did not anticipate. The implications extend far beyond novelty.

AI agentssocial networkagent interaction

Paper Review

Your Preferences Are Data: The Privacy Crisis in Reinforcement Learning from Human Feedback

When you tell an AI which response you prefer, you reveal your values, beliefs, and vulnerabilities. RLHF systems aggregate millions of such preference signals—creating a privacy risk that the alignment community has barely acknowledged. User-level differential privacy offers a path forward, but at a cost.

RLHF privacydifferential privacyuser-level privacy

Paper Review

Beyond Euclidean: Hyperbolic GNNs Crack the Drug-Target Prediction Problem

Biological networks—protein interactions, brain connectivity, metabolic pathways—are inherently hierarchical. Euclidean GNNs distort this hierarchy. Hyperbolic graph neural networks, operating in curved space, capture hierarchical structure with mathematical precision. The applications in drug discovery and neuroscience are already producing results.

hyperbolic geometrygraph neural networkdrug-target interaction

Paper Review

FROGENT: When Multi-Agent AI Systems Design Drugs End-to-End

Drug discovery takes 10-15 years and costs over $1 billion per approved compound. FROGENT deploys a multi-agent AI system that handles the entire pipeline—from target identification to molecular optimization—by coordinating specialized agents across fragmented computational tools.

drug discoverymulti-agent AIagentic AI

Trend Analysis

Satellite Intelligence: How Vision-Language Models Are Learning to Read the Earth

General-purpose VLMs struggle with satellite imagery because they were trained on internet photos, not overhead perspectives. A new generation of remote sensing foundation models—RingMoGPT, SegEarth-R1—is bridging this gap with domain-adapted architectures.

remote sensingvision-language modelsatellite imagery

Paper Review

Following the Money Graph: GNN + Reinforcement Learning for Financial Fraud Detection

Financial fraud evolves faster than static detection models can adapt. FraudGNN-RL combines graph neural networks—which capture the relational structure of transactions—with reinforcement learning that adapts to emerging fraud patterns in real time.

graph neural networkfraud detectionreinforcement learning

Trend Analysis

Digital Twins in Medicine: Virtual Patients for Drug Discovery and Personalized Treatment

A digital twin of a patient—a dynamic computational model updated with real-time health data—could enable drug testing on virtual patients before real ones. Ren et al.'s Alzheimer's application shows how this concept is moving from engineering metaphor to clinical tool.

digital twinprecision medicinedrug discovery

Paper Review

Decoding Speech from the Brain: BCI Language Systems Reach Real-Time Chinese

Speech BCIs that decode neural signals into language are advancing from English-only lab demos to real-time multilingual systems. Qian et al. demonstrate full-spectrum Chinese decoding, while Jude et al. restore communication to a locked-in patient. The clinical implications are immediate.

brain-computer interfaceBCIneural decoding

Paper Review

Zero-Shot 4D: Generating Dynamic 3D Worlds Without Any Training

Generating dynamic 3D content—objects that move through space and time—typically requires expensive training on 3D datasets. Zero4D and WorldForge achieve this without any training, by guiding existing video diffusion models with geometric constraints. The implications for content creation and simulation are substantial.

4D generationtraining-freevideo diffusion

Trend Analysis

When General Reasoning Meets Domain Expertise: LLMs in Law, Medicine, and Patent Analysis

General-purpose LLMs reason well on benchmarks but struggle in domains that require specialized knowledge structures—patent law's IRAC methodology, medical differential diagnosis, or regulatory compliance. Domain-adapted reasoning models are filling this gap.

domain-specific LLMlegal reasoningmedical reasoning

Paper Review

Breaking the Sequential Bottleneck: Parallel Tool Use in AI Agents

Most AI agents execute tools one at a time—search, then read, then analyze—even when tasks could be parallelized. GAP models sub-task dependencies as a directed graph, enabling parallel tool execution that improves throughput without sacrificing correctness.

AI agentsparallel tool usegraph planning

Trend Analysis

Zero Trust Meets AI: Rethinking Intrusion Detection for a Perimeter-Free World

Traditional perimeter-based security assumes a trusted inside and untrusted outside. Zero Trust assumes nothing is trusted—every access request must be verified. AI-powered intrusion detection within Zero Trust Architecture is emerging as the standard for industrial IoT and cloud security.

zero trustintrusion detectioncybersecurity

Paper Review

Proving Code Correct: Where Formal Verification Meets AI-Generated Software

AI can write code faster than humans—but can it prove that code is correct? PatchPilot combines AI patching agents with formal verification, while FVAPPS benchmarks the emerging capability of AI to generate both code and correctness proofs.

formal verificationcode generationsoftware testing

Field Map

LLMOrbit: Mapping Six Years of Language Model Evolution from Scaling Walls to Agentic Systems

Where did we come from, and where are we going? LLMOrbit maps the full landscape of large language models from 2019 to 2025 as a circular taxonomy—revealing that the field has hit scaling walls and is pivoting toward agentic architectures as the next growth vector.

LLM taxonomylanguage model evolutionscaling laws

Paper Review

DeepSeek-R1: When Reinforcement Learning Alone Produces Emergent Reasoning

The standard recipe for building a reasoning LLM involves supervised fine-tuning on curated chain-of-thought data before applying reinforcement learning. DeepSeek-R1 asks: what if you skip the supervised step entirely? The answer—that self-reflection, verification, and dynamic strategy adaptation emerge spontaneously from RL alone—challenges assumptions about how reasoning develops in language models.

DeepSeekreinforcement learningreasoning

Critical Review

SWE-Bench Pro: Why AI Coding Agents Struggle with Real Enterprise Code

AI coding agents solve 43.6% of public benchmark tasks—but how do they fare on real enterprise codebases? SWE-Bench Pro reveals that performance drops steeply when agents face long-horizon, multi-file engineering tasks drawn from commercial repositories, exposing a significant gap between benchmark scores and practical capability.

SWE-Benchcoding agentssoftware engineering

Critical Review

Multi-Agent Debate Is Overrated: The DOWN Framework for Selective AI Discussion

Multi-agent debate has been promoted as a way to improve LLM reasoning through deliberation—but does it actually help? Eo et al. (2025) show that debate often hurts performance and propose DOWN, a framework that debates only when necessary, achieving up to 6x efficiency gains.

multi-agentdebateLLM collaboration

Deep Dive

The RLVR Paradox: Why Checking Only the Answer Makes the Reasoning Right

A persistent worry in RL-trained reasoning models: if you only reward the final answer, won't the model learn to reach correct answers through flawed reasoning? A new theoretical result shows that under specific conditions, GRPO with binary verifiable rewards implicitly amplifies the probability of correct chain-of-thought—not just correct answers.

RLVRGRPOchain-of-thought

Methodology Guide

The MoE Takeover: Why 60% of 2025's LLMs Use Mixture-of-Experts

MoEmixture of expertsLLM architecture

Paper Review

Open-Sora 2.0: Commercial-Grade Video Generation for $200K

Training a video generation model that matches commercial leaders like Runway Gen-3 Alpha—for $200K instead of tens of millions. Open-Sora 2.0 demonstrates that aggressive cost engineering across data, architecture, and compute can reduce the barrier to high-quality video AI by orders of magnitude.

Open-Soravideo generationdiffusion model

Critical Review

Thinking Longer, Getting Wronger: The Counterintuitive Limits of Test-Time Compute

The intuition seems obvious: let the model think longer and it will reason better. But empirical findings challenge this assumption. Correct solutions tend to be shorter than incorrect ones on the same problem, and parallel sampling may outperform sequential deepening—suggesting that test-time compute scaling has limits the field has not fully reckoned with.

test-time computechain-of-thoughtscaling

Critical Review

Context Rot: Why Million-Token LLMs Still Lose Information in the Middle

Models advertise million-token context windows, but can they actually use all that context? Tavakoli et al. (2025) benchmark long-term memory in LLMs and find non-uniform degradation—performance drops as conversations expand, with information in the middle of long contexts systematically neglected.

long contextcontext windowlost in the middle

Deep Dive

DeepSeek-V3: How 671 Billion Parameters Activate Only 37 Billion Per Token

DeepSeek-V3 stores 671 billion parameters but activates only 37 billion per token—a ratio of roughly 18:1. This architectural choice, combining Multi-head Latent Attention with auxiliary-loss-free load balancing in a Mixture-of-Experts framework, achieves competitive performance at a reported training cost of 2.788 million H800 GPU-hours.

DeepSeek-V3MoEmixture of experts

Paper Review

AI Scientist v2: When Machine-Written Papers Pass Human Peer Review

AI Scientist-v2 automates the full scientific workflow—hypothesis formation, experimentation, data analysis, and paper writing—using agentic tree search. The resulting papers, fully AI-generated, achieve an average reviewer score of 6.33 in human peer review, meeting the acceptance threshold for workshop venues. The question is no longer whether AI can write papers, but what this means for scientific practice.

AI scientistautomated discoverypeer review

Paper Review

Neural Neural Scaling Laws: When AI Predicts Its Own Future Performance

Can AI predict AI's own scaling behavior? Hu et al. (2026) replace hand-designed scaling law formulas with a neural network that learns to predict downstream task performance, achieving 2.04% MAE across 66 tasks—a 38% error reduction over logistic baselines. The meta-level question: what does it mean when we need neural networks to understand neural networks?

scaling lawsmeta-learningprediction

Critical Review

Beyond Reward Hacking: Causal Approaches to AI Alignment

When AI systems learn to game their reward signals—achieving high scores without achieving the intended goals—the result is 'reward hacking.' A new approach using causal reasoning rather than correlation-based rewards may offer a path toward more robust AI alignment.

AI alignmentreward hackingcausal rewards

🤖 AI & Machine Learning