Paper ReviewMathematics & StatisticsMachine/Deep Learning
Generating Graphs the Bayesian Way: Discrete Diffusion for Molecular and Network Design
Graphs are discrete, unordered structuresโfundamentally different from the continuous data that standard diffusion models handle. Petersen et al. develop a Bayesian framework for discrete graph generation that combines diffusion and flow matching models with principled posterior inference.
By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
Generating graph-structured dataโmolecules with specific properties, social networks with desired characteristics, knowledge graphs with correct relational structureโis a central challenge in AI. Graphs are fundamentally different from images or text: they are discrete (nodes and edges are categorical, not continuous), unordered (there is no canonical ordering of nodes), and variable-sized (different graphs have different numbers of nodes and edges).
These properties make standard generative models (VAEs, GANs, continuous diffusion) poorly suited for graphs. Continuous diffusion adds Gaussian noise to pixel values; you cannot meaningfully add Gaussian noise to a graph's adjacency matrix (the result is no longer a valid graph). Autoregressive generation produces nodes in sequence, but the generation order is arbitraryโand different orderings produce the same graph, creating a many-to-one redundancy.
Petersen et al. develop a Bayesian framework for discrete graph generation that addresses these challenges by:
Working in the discrete domain nativelyโusing discrete diffusion and flow matching that operate on categorical node and edge types
Performing posterior inference rather than just samplingโenabling conditional generation (graphs with specific properties) through Bayesian conditioning
Handling graph symmetry through permutation-invariant architecturesThe Discrete Diffusion Framework
Continuous diffusion models corrupt data by adding Gaussian noise and then learn to reverse this corruption. Discrete diffusion models corrupt data by randomly replacing categorical values (node types, edge types) with random alternatives, then learn to reverse this corruptionโrecovering the original graph from a uniformly random graph.
The forward process is simple: at each step, each node/edge type has a probability of being randomly reassigned. After many steps, the graph becomes uniformly randomโall structural information is destroyed.
The reverse process is the learned generative model: given a noisy graph, predict the original graph. By iteratively applying the reverse process from pure noise, the model generates new graphs that match the distribution of training data.
Petersen et al.'s Bayesian contribution is enabling conditional generation through posterior inference. Given a desired property (a molecule with specific binding affinity, a network with specific degree distribution), the posterior distribution over graphs conditioned on the property can be approximated by modifying the reverse diffusion process to favor graphs consistent with the condition.
Applications
Molecular design: Generate molecules with target properties (solubility, binding affinity, toxicity) by conditioning the graph generation on property predictors.
Knowledge graphs: Generate plausible knowledge graph completionsโnew edges that are consistent with the existing graph structure.
Network synthesis: Generate synthetic networks with specific structural properties (clustering coefficient, degree distribution, community structure) for simulation and testing.
Claims and Evidence
<
| Claim | Evidence | Verdict |
|---|
| Discrete diffusion handles graph structure natively | Framework operates on categorical node/edge types | โ
Supported |
| Bayesian conditioning enables property-targeted generation | Posterior inference demonstrated for conditional generation | โ
Supported |
| Generated graphs match training distribution quality | Evaluation on molecular and network benchmarks | โ
Supported |
| The approach outperforms autoregressive graph generation | Competitive on benchmarks; advantages in symmetry handling | โ ๏ธ Competitive, not uniformly superior |
Open Questions
Scalability: Current demonstrations involve graphs with tens to hundreds of nodes. Can discrete diffusion scale to graphs with thousands of nodes (protein structures, large social networks)?Validity constraints: Not all graphs are valid molecules (valence rules, ring strain). How do we incorporate domain-specific validity constraints into the generation process?Multi-objective conditioning: Real molecular design involves multiple simultaneous objectives (potency AND selectivity AND solubility). How do we condition on multiple properties without generating Pareto-suboptimal compromises?Evaluation metrics: How do we evaluate the quality of generated graphs? Distributional metrics (comparing generated vs. real graph distributions) are standard but may miss important structural properties.What This Means for Your Research
For computational chemists, Bayesian graph generation provides a principled framework for molecular design that explicitly handles the discrete, unordered nature of molecular graphsโa more natural fit than continuous generative models that must be adapted.
For graph ML researchers, the discrete Bayesian framework provides a theoretically grounded alternative to the ad hoc adaptations of continuous generative models to discrete graph data that have dominated the field.
๋ฉด์ฑ
์กฐํญ: ์ด ๊ฒ์๋ฌผ์ ์ ๋ณด ์ ๊ณต ๋ชฉ์ ์ ์ฐ๊ตฌ ๋ํฅ ๊ฐ์์ด๋ค. ํ์ ์ฐ๊ตฌ์์ ์ธ์ฉํ๊ธฐ ์ ์ ๊ตฌ์ฒด์ ์ธ ๋ฐ๊ฒฌ, ํต๊ณ ๋ฐ ์ฃผ์ฅ์ ์๋ณธ ๋
ผ๋ฌธ๊ณผ ๋์กฐํ์ฌ ๊ฒ์ฆํด์ผ ํ๋ค.
๋ฒ ์ด์ง์ ๋ฐฉ์์ผ๋ก ๊ทธ๋ํ ์์ฑํ๊ธฐ: ๋ถ์ ๋ฐ ๋คํธ์ํฌ ์ค๊ณ๋ฅผ ์ํ ์ด์ฐ ํ์ฐ
๊ทธ๋ํ ๊ตฌ์กฐ ๋ฐ์ดํฐ์ ์์ฑโํน์ ์์ฑ์ ์ง๋ ๋ถ์, ์ํ๋ ํน์ฑ์ ์ง๋ ์์
๋คํธ์ํฌ, ์ฌ๋ฐ๋ฅธ ๊ด๊ณ ๊ตฌ์กฐ๋ฅผ ์ง๋ ์ง์ ๊ทธ๋ํโ์ AI์ ํต์ฌ ๊ณผ์ ์ด๋ค. ๊ทธ๋ํ๋ ์ด๋ฏธ์ง๋ ํ
์คํธ์ ๊ทผ๋ณธ์ ์ผ๋ก ๋ค๋ฅด๋ค. ๊ทธ๋ํ๋ ์ด์ฐ์ ์ด๊ณ (๋
ธ๋์ ์ฃ์ง๋ ๋ฒ์ฃผํ์ด๋ฉฐ ์ฐ์์ ์ด์ง ์์), ์์๊ฐ ์์ผ๋ฉฐ(๋
ธ๋์ ์ ๊ท ์์๊ฐ ์กด์ฌํ์ง ์์), ๊ฐ๋ณ ํฌ๊ธฐ์ด๋ค(๊ทธ๋ํ๋ง๋ค ๋
ธ๋์ ์ฃ์ง์ ์๊ฐ ๋ค๋ฆ).
์ด๋ฌํ ํน์ฑ์ผ๋ก ์ธํด ํ์ค ์์ฑ ๋ชจ๋ธ(VAE, GAN, ์ฐ์ ํ์ฐ)์ ๊ทธ๋ํ์ ์ ํฉํ์ง ์๋ค. ์ฐ์ ํ์ฐ์ ํฝ์
๊ฐ์ ๊ฐ์ฐ์์ ๋
ธ์ด์ฆ๋ฅผ ์ถ๊ฐํ์ง๋ง, ๊ทธ๋ํ์ ์ธ์ ํ๋ ฌ์ ๊ฐ์ฐ์์ ๋
ธ์ด์ฆ๋ฅผ ์๋ฏธ ์๊ฒ ์ถ๊ฐํ๋ ๊ฒ์ ๋ถ๊ฐ๋ฅํ๋ค(๊ฒฐ๊ณผ๊ฐ ๋ ์ด์ ์ ํจํ ๊ทธ๋ํ๊ฐ ์๋). ์๊ธฐํ๊ท ์์ฑ์ ์์๋๋ก ๋
ธ๋๋ฅผ ์์ฑํ์ง๋ง, ์์ฑ ์์๋ ์์์ ์ด๋ฉฐ ์๋ก ๋ค๋ฅธ ์์๊ฐ ๋์ผํ ๊ทธ๋ํ๋ฅผ ์์ฑํ์ฌ ๋ค๋์ผ ์ค๋ณต์ฑ์ด ๋ฐ์ํ๋ค.
Petersen et al.์ ๋ค์์ ํตํด ์ด๋ฌํ ๊ณผ์ ๋ฅผ ํด๊ฒฐํ๋ ์ด์ฐ ๊ทธ๋ํ ์์ฑ์ ์ํ ๋ฒ ์ด์ง์ ํ๋ ์์ํฌ๋ฅผ ๊ฐ๋ฐํ๋ค.
์ด์ฐ ์์ญ์์ ์ง์ ์๋โ๋ฒ์ฃผํ ๋
ธ๋ ๋ฐ ์ฃ์ง ์ ํ์์ ์๋ํ๋ ์ด์ฐ ํ์ฐ ๋ฐ ํ๋ฆ ๋งค์นญ ์ฌ์ฉ
๋จ์ ์ํ๋ง์ด ์๋ ์ฌํ ์ถ๋ก ์ํโ๋ฒ ์ด์ง์ ์กฐ๊ฑดํ๋ฅผ ํตํ ์กฐ๊ฑด๋ถ ์์ฑ(ํน์ ์์ฑ์ ์ง๋ ๊ทธ๋ํ) ๊ฐ๋ฅ
์์ด ๋ถ๋ณ ์ํคํ
์ฒ๋ฅผ ํตํ ๊ทธ๋ํ ๋์นญ์ฑ ์ฒ๋ฆฌ์ด์ฐ ํ์ฐ ํ๋ ์์ํฌ
์ฐ์ ํ์ฐ ๋ชจ๋ธ์ ๊ฐ์ฐ์์ ๋
ธ์ด์ฆ๋ฅผ ์ถ๊ฐํ์ฌ ๋ฐ์ดํฐ๋ฅผ ์์์ํค๊ณ , ์ด ์์์ ์ญ์ ์ํค๋ ๋ฐฉ๋ฒ์ ํ์ตํ๋ค. ์ด์ฐ ํ์ฐ ๋ชจ๋ธ์ ๋ฒ์ฃผํ ๊ฐ(๋
ธ๋ ์ ํ, ์ฃ์ง ์ ํ)์ ์์์ ๋์์ผ๋ก ๋ฌด์์ ๊ต์ฒดํ์ฌ ๋ฐ์ดํฐ๋ฅผ ์์์ํจ ํ, ์ด ์์์ ์ญ์ ์ํค๋ ๋ฐฉ๋ฒ์ ํ์ตํ๋คโ๊ท ์ผํ๊ฒ ๋ฌด์์์ธ ๊ทธ๋ํ์์ ์๋ ๊ทธ๋ํ๋ฅผ ๋ณต์ํ๋ค.
์๋ฐฉํฅ ํ๋ก์ธ์ค๋ ๋จ์ํ๋ค. ๊ฐ ๋จ๊ณ์์ ๋
ธ๋/์ฃ์ง ์ ํ์ด ์์๋ก ์ฌํ ๋น๋ ํ๋ฅ ์ด ์กด์ฌํ๋ค. ์ฌ๋ฌ ๋จ๊ณ๊ฐ ์ง๋๋ฉด ๊ทธ๋ํ๋ ๊ท ์ผํ๊ฒ ๋ฌด์์๊ฐ ๋์ด ๋ชจ๋ ๊ตฌ์กฐ์ ์ ๋ณด๊ฐ ์๋ฉธ๋๋ค.
์ญ๋ฐฉํฅ ํ๋ก์ธ์ค๋ ํ์ต๋ ์์ฑ ๋ชจ๋ธ์ด๋ค. ๋
ธ์ด์ฆ๊ฐ ์๋ ๊ทธ๋ํ๊ฐ ์ฃผ์ด์ง๋ฉด ์๋ ๊ทธ๋ํ๋ฅผ ์์ธกํ๋ค. ์์ ๋
ธ์ด์ฆ์์ ์ญ๋ฐฉํฅ ํ๋ก์ธ์ค๋ฅผ ๋ฐ๋ณต์ ์ผ๋ก ์ ์ฉํจ์ผ๋ก์จ ๋ชจ๋ธ์ ํ๋ จ ๋ฐ์ดํฐ์ ๋ถํฌ์ ์ผ์นํ๋ ์๋ก์ด ๊ทธ๋ํ๋ฅผ ์์ฑํ๋ค.
Petersen et al.์ ๋ฒ ์ด์ง์ ๊ธฐ์ฌ๋ ์ฌํ ์ถ๋ก ์ ํตํ ์กฐ๊ฑด๋ถ ์์ฑ์ ๊ฐ๋ฅํ๊ฒ ํ๋ ๊ฒ์ด๋ค. ์ํ๋ ์์ฑ(ํน์ ๊ฒฐํฉ ์นํ๋๋ฅผ ์ง๋ ๋ถ์, ํน์ ์ฐจ์ ๋ถํฌ๋ฅผ ์ง๋ ๋คํธ์ํฌ)์ด ์ฃผ์ด์ง๋ฉด, ํด๋น ์์ฑ์ ์กฐ๊ฑดํ๋ ๊ทธ๋ํ์ ์ฌํ ๋ถํฌ๋ฅผ ์กฐ๊ฑด์ ์ผ์นํ๋ ๊ทธ๋ํ๋ฅผ ์ ํธํ๋๋ก ์ญ๋ฐฉํฅ ํ์ฐ ํ๋ก์ธ์ค๋ฅผ ์์ ํ์ฌ ๊ทผ์ฌํ ์ ์๋ค.
์์ฉ
๋ถ์ ์ค๊ณ: ์์ฑ ์์ธก๊ธฐ์ ๊ทธ๋ํ ์์ฑ์ ์กฐ๊ฑดํํ์ฌ ๋ชฉํ ์์ฑ(์ฉํด๋, ๊ฒฐํฉ ์นํ๋, ๋
์ฑ)์ ์ง๋ ๋ถ์๋ฅผ ์์ฑํ๋ค.
์ง์ ๊ทธ๋ํ: ๊ธฐ์กด ๊ทธ๋ํ ๊ตฌ์กฐ์ ์ผ์นํ๋ ์๋ก์ด ์ฃ์ง์ธ ๊ทธ๋ด๋ฏํ ์ง์ ๊ทธ๋ํ ์์ฑ์ ์์ฑํ๋ค.
๋คํธ์ํฌ ํฉ์ฑ: ์๋ฎฌ๋ ์ด์
๋ฐ ํ
์คํธ๋ฅผ ์ํด ํน์ ๊ตฌ์กฐ์ ์์ฑ(๊ตฐ์งํ ๊ณ์, ์ฐจ์ ๋ถํฌ, ์ปค๋ฎค๋ํฐ ๊ตฌ์กฐ)์ ์ง๋ ํฉ์ฑ ๋คํธ์ํฌ๋ฅผ ์์ฑํ๋ค.
์ฃผ์ฅ๊ณผ ๊ทผ๊ฑฐ
<
| ์ฃผ์ฅ | ๊ทผ๊ฑฐ | ํ์ |
|---|
| ์ด์ฐ ํ์ฐ์ด ๊ทธ๋ํ ๊ตฌ์กฐ๋ฅผ ์ง์ ์ฒ๋ฆฌ | ํ๋ ์์ํฌ๊ฐ ๋ฒ์ฃผํ ๋
ธ๋/์ฃ์ง ์ ํ์์ ์๋ | โ
์ง์ง๋จ |
| ๋ฒ ์ด์ง์ ์กฐ๊ฑดํ๊ฐ ์์ฑ ๋ชฉํ ์์ฑ์ ๊ฐ๋ฅํ๊ฒ ํจ | ์กฐ๊ฑด๋ถ ์์ฑ์ ์ํ ์ฌํ ์ถ๋ก ์์ฐ | โ
์ง์ง๋จ |
| ์์ฑ๋ ๊ทธ๋ํ๊ฐ ํ๋ จ ๋ถํฌ ํ์ง๊ณผ ์ผ์น | ๋ถ์ ๋ฐ ๋คํธ์ํฌ ๋ฒค์น๋งํฌ์์ ํ๊ฐ | โ
์ง์ง๋จ |
| ์ด ์ ๊ทผ๋ฒ์ ์๊ธฐํ๊ท(autoregressive) ๊ทธ๋ํ ์์ฑ์ ๋ฅ๊ฐํ๋ค | ๋ฒค์น๋งํฌ์์ ๊ฒฝ์๋ ฅ ์์; ๋์นญ์ฑ ์ฒ๋ฆฌ์์ ์ฅ์ | โ ๏ธ ๊ฒฝ์๋ ฅ ์์ผ๋, ๋ชจ๋ ๋ฉด์์ ์ฐ์ํ์ง๋ ์์ |
๋ฏธํด๊ฒฐ ๋ฌธ์
ํ์ฅ์ฑ(Scalability): ํ์ฌ ์์ฐ์ ์์ญ์์ ์๋ฐฑ ๊ฐ์ ๋
ธ๋๋ฅผ ๊ฐ์ง ๊ทธ๋ํ๋ฅผ ๋์์ผ๋ก ํ๋ค. ์ด์ฐ ํ์ฐ(discrete diffusion)์ ์์ฒ ๊ฐ์ ๋
ธ๋๋ฅผ ๊ฐ์ง ๊ทธ๋ํ(๋จ๋ฐฑ์ง ๊ตฌ์กฐ, ๋๊ท๋ชจ ์์
๋คํธ์ํฌ)๋ก ํ์ฅ๋ ์ ์๋๊ฐ?์ ํจ์ฑ ์ ์ฝ(Validity constraints): ๋ชจ๋ ๊ทธ๋ํ๊ฐ ์ ํจํ ๋ถ์์ธ ๊ฒ์ ์๋๋ค(์์๊ฐ ๊ท์น, ๊ณ ๋ฆฌ ๋ณํ). ๋๋ฉ์ธ ํนํ ์ ํจ์ฑ ์ ์ฝ์ ์์ฑ ๊ณผ์ ์ ์ด๋ป๊ฒ ํตํฉํ ๊ฒ์ธ๊ฐ?๋ค๋ชฉ์ ์กฐ๊ฑดํ(Multi-objective conditioning): ์ค์ ๋ถ์ ์ค๊ณ๋ ๋ค์์ ๋ชฉํ๋ฅผ ๋์์ ๋ค๋ฃฌ๋ค(ํจ๋ฅ AND ์ ํ์ฑ AND ์ฉํด๋). Pareto ์ต์ ์ดํ์ ํํ์์ ์์ฑํ์ง ์์ผ๋ฉด์ ์ฌ๋ฌ ํน์ฑ์ ๋ํด ์ด๋ป๊ฒ ์กฐ๊ฑดํํ ๊ฒ์ธ๊ฐ?ํ๊ฐ ์งํ(Evaluation metrics): ์์ฑ๋ ๊ทธ๋ํ์ ํ์ง์ ์ด๋ป๊ฒ ํ๊ฐํ ๊ฒ์ธ๊ฐ? ๋ถํฌ ๊ธฐ๋ฐ ์งํ(์์ฑ๋ ๊ทธ๋ํ ๋ถํฌ์ ์ค์ ๊ทธ๋ํ ๋ถํฌ์ ๋น๊ต)๋ ํ์ค์ ์ด์ง๋ง, ์ค์ํ ๊ตฌ์กฐ์ ํน์ฑ์ ๋์น ์ ์๋ค.์ฐ๊ตฌ์ ๋ํ ์์ฌ์
๊ณ์ฐ ํํ์์๊ฒ ์์ด, ๋ฒ ์ด์ฆ(Bayesian) ๊ทธ๋ํ ์์ฑ์ ๋ถ์ ๊ทธ๋ํ์ ์ด์ฐ์ ์ด๊ณ ๋น์์์ ์ธ ํน์ฑ์ ๋ช
์์ ์ผ๋ก ์ฒ๋ฆฌํ๋ ๋ถ์ ์ค๊ณ์ ์์น์ ํ๋ ์์ํฌ๋ฅผ ์ ๊ณตํ๋ค. ์ด๋ ์ ์์ด ํ์ํ ์ฐ์ ์์ฑ ๋ชจ๋ธ๋ณด๋ค ๋ ์์ฐ์ค๋ฌ์ด ์ ๊ทผ ๋ฐฉ์์ด๋ค.
๊ทธ๋ํ ML ์ฐ๊ตฌ์์๊ฒ ์์ด, ์ด์ฐ ๋ฒ ์ด์ฆ(Bayesian) ํ๋ ์์ํฌ๋ ํด๋น ๋ถ์ผ๋ฅผ ์ง๋ฐฐํด ์จ ์ด์ฐ ๊ทธ๋ํ ๋ฐ์ดํฐ์ ๋ํ ์ฐ์ ์์ฑ ๋ชจ๋ธ์ ์์๋ฐฉํธ์ (ad hoc) ์ ์์ ๋ํ ์ด๋ก ์ ์ผ๋ก ๊ฒฌ๊ณ ํ ๋์์ ์ ๊ณตํ๋ค.
References (1)
[1] Petersen, O., Kollovieh, M., Lienen, M. (2025). Discrete Bayesian Sample Inference for Graph Generation. arXiv:2511.03015.