Paper ReviewMathematics & StatisticsMachine/Deep Learning

Generating Graphs the Bayesian Way: Discrete Diffusion for Molecular and Network Design

Graphs are discrete, unordered structures—fundamentally different from the continuous data that standard diffusion models handle. Petersen et al. develop a Bayesian framework for discrete graph generation that combines diffusion and flow matching models with principled posterior inference.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Generating graph-structured data—molecules with specific properties, social networks with desired characteristics, knowledge graphs with correct relational structure—is a central challenge in AI. Graphs are fundamentally different from images or text: they are discrete (nodes and edges are categorical, not continuous), unordered (there is no canonical ordering of nodes), and variable-sized (different graphs have different numbers of nodes and edges).

These properties make standard generative models (VAEs, GANs, continuous diffusion) poorly suited for graphs. Continuous diffusion adds Gaussian noise to pixel values; you cannot meaningfully add Gaussian noise to a graph's adjacency matrix (the result is no longer a valid graph). Autoregressive generation produces nodes in sequence, but the generation order is arbitrary—and different orderings produce the same graph, creating a many-to-one redundancy.

Petersen et al. develop a Bayesian framework for discrete graph generation that addresses these challenges by:

Working in the discrete domain natively—using discrete diffusion and flow matching that operate on categorical node and edge types

Performing posterior inference rather than just sampling—enabling conditional generation (graphs with specific properties) through Bayesian conditioning

Handling graph symmetry through permutation-invariant architectures

The Discrete Diffusion Framework

Continuous diffusion models corrupt data by adding Gaussian noise and then learn to reverse this corruption. Discrete diffusion models corrupt data by randomly replacing categorical values (node types, edge types) with random alternatives, then learn to reverse this corruption—recovering the original graph from a uniformly random graph.

The forward process is simple: at each step, each node/edge type has a probability of being randomly reassigned. After many steps, the graph becomes uniformly random—all structural information is destroyed.

The reverse process is the learned generative model: given a noisy graph, predict the original graph. By iteratively applying the reverse process from pure noise, the model generates new graphs that match the distribution of training data.

Petersen et al.'s Bayesian contribution is enabling conditional generation through posterior inference. Given a desired property (a molecule with specific binding affinity, a network with specific degree distribution), the posterior distribution over graphs conditioned on the property can be approximated by modifying the reverse diffusion process to favor graphs consistent with the condition.

Applications

Molecular design: Generate molecules with target properties (solubility, binding affinity, toxicity) by conditioning the graph generation on property predictors.

Knowledge graphs: Generate plausible knowledge graph completions—new edges that are consistent with the existing graph structure.

Network synthesis: Generate synthetic networks with specific structural properties (clustering coefficient, degree distribution, community structure) for simulation and testing.

Claims and Evidence

Claim	Evidence	Verdict
Discrete diffusion handles graph structure natively	Framework operates on categorical node/edge types	✅ Supported
Bayesian conditioning enables property-targeted generation	Posterior inference demonstrated for conditional generation	✅ Supported
Generated graphs match training distribution quality	Evaluation on molecular and network benchmarks	✅ Supported
The approach outperforms autoregressive graph generation	Competitive on benchmarks; advantages in symmetry handling	⚠️ Competitive, not uniformly superior

Open Questions

Scalability: Current demonstrations involve graphs with tens to hundreds of nodes. Can discrete diffusion scale to graphs with thousands of nodes (protein structures, large social networks)?

Validity constraints: Not all graphs are valid molecules (valence rules, ring strain). How do we incorporate domain-specific validity constraints into the generation process?

Multi-objective conditioning: Real molecular design involves multiple simultaneous objectives (potency AND selectivity AND solubility). How do we condition on multiple properties without generating Pareto-suboptimal compromises?

Evaluation metrics: How do we evaluate the quality of generated graphs? Distributional metrics (comparing generated vs. real graph distributions) are standard but may miss important structural properties.

What This Means for Your Research

For computational chemists, Bayesian graph generation provides a principled framework for molecular design that explicitly handles the discrete, unordered nature of molecular graphs—a more natural fit than continuous generative models that must be adapted.

For graph ML researchers, the discrete Bayesian framework provides a theoretically grounded alternative to the ad hoc adaptations of continuous generative models to discrete graph data that have dominated the field.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 발견, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

베이지안 방식으로 그래프 생성하기: 분자 및 네트워크 설계를 위한 이산 확산

그래프 구조 데이터의 생성—특정 속성을 지닌 분자, 원하는 특성을 지닌 소셜 네트워크, 올바른 관계 구조를 지닌 지식 그래프—은 AI의 핵심 과제이다. 그래프는 이미지나 텍스트와 근본적으로 다르다. 그래프는 이산적이고(노드와 엣지는 범주형이며 연속적이지 않음), 순서가 없으며(노드의 정규 순서가 존재하지 않음), 가변 크기이다(그래프마다 노드와 엣지의 수가 다름).

이러한 특성으로 인해 표준 생성 모델(VAE, GAN, 연속 확산)은 그래프에 적합하지 않다. 연속 확산은 픽셀 값에 가우시안 노이즈를 추가하지만, 그래프의 인접 행렬에 가우시안 노이즈를 의미 있게 추가하는 것은 불가능하다(결과가 더 이상 유효한 그래프가 아님). 자기회귀 생성은 순서대로 노드를 생성하지만, 생성 순서는 임의적이며 서로 다른 순서가 동일한 그래프를 생성하여 다대일 중복성이 발생한다.

Petersen et al.은 다음을 통해 이러한 과제를 해결하는 이산 그래프 생성을 위한 베이지안 프레임워크를 개발한다.

이산 영역에서 직접 작동—범주형 노드 및 엣지 유형에서 작동하는 이산 확산 및 흐름 매칭 사용

단순 샘플링이 아닌 사후 추론 수행—베이지안 조건화를 통한 조건부 생성(특정 속성을 지닌 그래프) 가능

순열 불변 아키텍처를 통한 그래프 대칭성 처리

이산 확산 프레임워크

연속 확산 모델은 가우시안 노이즈를 추가하여 데이터를 손상시키고, 이 손상을 역전시키는 방법을 학습한다. 이산 확산 모델은 범주형 값(노드 유형, 엣지 유형)을 임의의 대안으로 무작위 교체하여 데이터를 손상시킨 후, 이 손상을 역전시키는 방법을 학습한다—균일하게 무작위인 그래프에서 원래 그래프를 복원한다.

순방향 프로세스는 단순하다. 각 단계에서 노드/엣지 유형이 임의로 재할당될 확률이 존재한다. 여러 단계가 지나면 그래프는 균일하게 무작위가 되어 모든 구조적 정보가 소멸된다.

역방향 프로세스는 학습된 생성 모델이다. 노이즈가 있는 그래프가 주어지면 원래 그래프를 예측한다. 순수 노이즈에서 역방향 프로세스를 반복적으로 적용함으로써 모델은 훈련 데이터의 분포와 일치하는 새로운 그래프를 생성한다.

Petersen et al.의 베이지안 기여는 사후 추론을 통한 조건부 생성을 가능하게 하는 것이다. 원하는 속성(특정 결합 친화도를 지닌 분자, 특정 차수 분포를 지닌 네트워크)이 주어지면, 해당 속성에 조건화된 그래프의 사후 분포를 조건에 일치하는 그래프를 선호하도록 역방향 확산 프로세스를 수정하여 근사할 수 있다.

응용

분자 설계: 속성 예측기에 그래프 생성을 조건화하여 목표 속성(용해도, 결합 친화도, 독성)을 지닌 분자를 생성한다.

지식 그래프: 기존 그래프 구조와 일치하는 새로운 엣지인 그럴듯한 지식 그래프 완성을 생성한다.

네트워크 합성: 시뮬레이션 및 테스트를 위해 특정 구조적 속성(군집화 계수, 차수 분포, 커뮤니티 구조)을 지닌 합성 네트워크를 생성한다.

주장과 근거

주장	근거	판정
이산 확산이 그래프 구조를 직접 처리	프레임워크가 범주형 노드/엣지 유형에서 작동	✅ 지지됨
베이지안 조건화가 속성 목표 생성을 가능하게 함	조건부 생성을 위한 사후 추론 시연	✅ 지지됨
생성된 그래프가 훈련 분포 품질과 일치	분자 및 네트워크 벤치마크에서 평가	✅ 지지됨
이 접근법은 자기회귀(autoregressive) 그래프 생성을 능가한다	벤치마크에서 경쟁력 있음; 대칭성 처리에서 장점	⚠️ 경쟁력 있으나, 모든 면에서 우월하지는 않음

미해결 문제

확장성(Scalability): 현재 시연은 수십에서 수백 개의 노드를 가진 그래프를 대상으로 한다. 이산 확산(discrete diffusion)은 수천 개의 노드를 가진 그래프(단백질 구조, 대규모 소셜 네트워크)로 확장될 수 있는가?

유효성 제약(Validity constraints): 모든 그래프가 유효한 분자인 것은 아니다(원자가 규칙, 고리 변형). 도메인 특화 유효성 제약을 생성 과정에 어떻게 통합할 것인가?

다목적 조건화(Multi-objective conditioning): 실제 분자 설계는 다수의 목표를 동시에 다룬다(효능 AND 선택성 AND 용해도). Pareto 최적 이하의 타협안을 생성하지 않으면서 여러 특성에 대해 어떻게 조건화할 것인가?

평가 지표(Evaluation metrics): 생성된 그래프의 품질을 어떻게 평가할 것인가? 분포 기반 지표(생성된 그래프 분포와 실제 그래프 분포의 비교)는 표준적이지만, 중요한 구조적 특성을 놓칠 수 있다.

연구에 대한 시사점

계산 화학자에게 있어, 베이즈(Bayesian) 그래프 생성은 분자 그래프의 이산적이고 비순서적인 특성을 명시적으로 처리하는 분자 설계의 원칙적 프레임워크를 제공한다. 이는 적응이 필요한 연속 생성 모델보다 더 자연스러운 접근 방식이다.

그래프 ML 연구자에게 있어, 이산 베이즈(Bayesian) 프레임워크는 해당 분야를 지배해 온 이산 그래프 데이터에 대한 연속 생성 모델의 임시방편적(ad hoc) 적응에 대한 이론적으로 견고한 대안을 제공한다.

References (1)

[1] Petersen, O., Kollovieh, M., Lienen, M. (2025). Discrete Bayesian Sample Inference for Graph Generation. arXiv:2511.03015.

DOI Scholar