Posterior Refinement:
Fast Language Generation via Any-Order Flow Maps

1Carnegie Mellon University, 2KAIST
*Equal contribution (order decided by a coin flip)  ·  Equal advising

TL;DR

We make iterative refinement actually work for non-autoregressive LMs.

  • The Problem: Discrete diffusion models promise iterative refinement, but sampling independent marginals causes factorization error and collapses performance. The generate, erase, repeat loop becomes infeasible when every single generation takes several function evaluations.
  • The Missing Piece: Flow Map Language Models (FMLMs) fix this factorization error via joint transport between Gaussian noise and one-hot data.
  • The Solution: We put it all together. Masking-style FMLM enables Posterior Refinement, letting the model self-correct based on a full draft sequence with a novel a posteriori confidence signal.
  • The Result: SOTA speed-quality tradeoff, matching discrete baselines with 32× fewer NFEs.

Iterative refinement on GSM8K. FMLM+ unlocks self-correction, outperforming all diffusion baselines with 1024 function evaluations using as few as 32 rounds of Posterior Refinement. Leveraging a pretrained MDM lifts the curve further still.

Abstract

Non-autoregressive generation offers a powerful paradigm for iterative refinement, allowing models to recursively critique, erase and regenerate arbitrary subsets of tokens. However, existing non-autoregressive models fail to realize this potential. Masked Diffusion Models (MDMs) suffer from factorization error, causing sample quality to collapse when generating multiple tokens simultaneously. Flow Map Language Models (FMLMs) circumvent this bottleneck via joint sequence transport for excellent few-step generation, but sacrifice the inference-time flexibility of MDMs.

We introduce FMLM+, a framework that bridges this gap by equipping FMLM with masking-style noise schedules. While generating the full sequence in a single step, FMLM+ simultaneously scores the global consistency of each token a posteriori. We leverage this to introduce Posterior Refinement, a novel inference-time refinement strategy that enables the model to adaptively self-correct its outputs, matching the performance of discrete baselines with 32× fewer NFEs. Across diverse benchmarks, we demonstrate that FMLM+ with Posterior Refinement improves the speed–quality tradeoff over both MDM and FMLM families, providing a scalable foundation for high-fidelity language modeling.

How Posterior Refinement Works

Posterior Refinement with FMLM+. Posterior Refinement lets the model judge the fit of each token after the fact and fix its own mistakes in parallel. The model generates all tokens in parallel and scores each token's posterior confidence given the entire draft. It commits the high-confidence tokens, re-noises the rest, and repeats. Crucially, the incorrect tokens consistently fall within the low-confidence set, so refinement reliably filters and revises errors.

A-Priori vs. A-Posteriori Confidence

MDMs only produce independent token-wise marginals. To decide which tokens to keep, they estimate confidence a priori — without knowing what tokens will be generated at other positions. In contrast, Posterior Refinement estimates confidence a posteriori — after the full sequence is generated. This allows the model to assess global consistency and revise mistakes based on the entire context.

Confidence sampling with MDMs has been popularized by solving Sudoku puzzles. We show that when the target is multi-modal, such as when generating Sudoku boards, a-priori confidence fails catastrophically, while FMLM+'s a-posteriori confidence achieves 87% accuracy in just a few evaluations.

Method Unconditional Sudoku (%, ↑)
1392781
MDLM (Ancestral)0007.733.8
MDLM (Confidence)0000.097.2
FMLM+ (a-posteriori)022.287.198.899.6

Accuracy (%, ↑) versus number of function evaluations. A-priori confidence fails catastrophically when sampling multiple tokens at once.

Results

Posterior Refinement delivers large efficiency gains across all considered benchmarks. On Sudoku, FMLM+ reaches 97.9 / 92.0 / 71.2 (Easy/Med./Hard) with just 4 NFEs — surpassing every baseline evaluated at 128 NFEs. On GSM8K, it reaches 19.0% accuracy at 32 NFEs, a 32× speedup over the strongest non-autoregressive baselines at matched accuracy. On TinyStories and OpenWebText, FMLM+ matches or surpasses every diffusion baseline with up to 8× fewer evaluations.

Conditional generation

Method Sudoku GSM8K
NFEEasyMed.Hard NFEAcc. (%)
Autoregressive
Sample12813.95.10.651253.9
Discrete Diffusion
MDLM12892.077.130.2102418.0
Duo12896.384.758.4102417.2
Continuous Diffusion
CANDI12879.345.916.710240.2
S-FLM12894.885.245.0102418.0
FMLM+ (PR)497.992.071.23219.0

Unconditional generation

Method TinyStories OpenWebText
NFEGen. PPL ↓Entropy ↑ NFEGen. PPL ↓Entropy ↑
Autoregressive
Sample1288.894.01102435.455.58
Discrete Diffusion
MDLM12818.744.031024105.155.63
Duo12822.734.05102477.695.55
Continuous Diffusion
CANDI12846.324.041024143.135.71
FLM12857.504.17102462.235.33
S-FLM12895.254.081024123.875.52
FMLM+ (PR)3217.533.9612866.65.21

Path To Scaling

We show that the masked diffusion training objective is exactly the boundary case s = t = 0 of the FMLM+ training objective. This correspondence makes the growing pool of pretrained MDMs directly usable to accelerate FMLM+ training, via either distillation (using the MDM as a teacher) or direct warm-starting from its weights.

Method GSM8K (%, ↑) by NFE
1248163264
MDLM (Teacher)0.00.00.10.64.79.013.4
FMLM+0.00.12.98.713.419.019.1
FMLM+ (Distill)0.30.43.910.318.721.623.4
FMLM+ (Init)0.30.75.115.126.131.833.6

GSM8K accuracy (%, ↑) under Posterior Refinement, comparing three FMLM+ training strategies against the MDM teacher.

Improved training for FMLM+. Left: GSM8K accuracy under 32 rounds of Posterior Refinement — both teacher-based variants beat training from scratch. Right: training loss — warm-starting converges faster and to a better optimum.

Understanding A-Posteriori Confidence

Starting from pure Gaussian noise, FMLM+ steps along the flow trajectory to simultaneously denoise all positions. At the final integration step, because the model operates on a nearly complete sequence, it implicitly evaluates the conditional probability of each token given the rest of the generated text. This enables us to compute token confidence conditioned on the fully generated sequence, effectively evaluating each token's fit within the global context.

Similar to how MDMs use the maximum token probability as a confidence measure, we use $$p_{\max}^l := \max_v(\hat{x}^{l,v}),$$ where $\hat{x}$ is the the empirical generated sequence, as the a posteriori confidence measure for position $l$.

An ideal FMLM would always produce one-hot vectors. In practice, however, the model outputs categorical distributions that are not strictly one-hot. We interpret the rounding error of this projection as a proxy for the model's confidence. This interpretation aligns with our empirical observations in Sudoku, where we find that incorrect tokens consistently exhibit high rounding error.

Key idea. MDMs, and equivalently δ0,0, conflate aleatoric uncertainty, the inherent randomness of the data distribution, with epistemic uncertainty, the model’s internal confidence. We hypothesize that by failing to perfectly learn the joint transport from Gaussian noise to clean one-hot data, the FMLM trajectory is able to surface the true epistemic uncertainty of the model through errors in its predictions.

Two types of confidence. On a 2-mode toy problem, the one-step FMLM δ0,1 outputs high confidence for most inputs. Low-confidence regions concentrate near decision boundaries, where the endpoint is ambiguous, indicating δ0,1 captures the epistemic confidence of the model. In contrast, δ0,0 is flat for all inputs, reflecting its aleatoric nature.

BibTeX

@article{agarwal2026posteriorrefinement,
    title={Posterior Refinement: Fast Language Generation via Any-Order Flow Maps},
    author={Manan Agarwal and Sheel Shah and Chanhyuk Lee
            and Jaehoon Yoo and Jerry Huang and Seunghoon Hong
            and Aditi Raghunathan and Jinwoo Kim and Nicholas M. Boffi},
    journal={arXiv preprint arXiv:2606.24773},
    year={2026},
}

Contact

If you have any questions about the paper, code, or potential collaborations, please feel free to reach out to us at mananaga, sheels@cs.cmu.edu.

Acknowledgments

We would like to thank Modal Labs for their generous compute grants, which proved invaluable in supporting this work.