WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen1*, Shengpeng Ji1*, Qian Chen2*, Tianle Liang1, Yangzhuo Li1, Ziqing Wang3, Wen Wang2, Jingyu Lu1, Haoxiao Wang1, Xueyi Pu1, Fan Zhuo1, Zhou Zhao1†
1Zhejiang University, 2Alibaba Group, 3Beijing University of Technology
*Equal contribution, Corresponding author
Motivation and failure mode of unified RL
Motivation and failure mode of unified RL for end-to-end spoken dialogue models. Applying a single preference objective to mixed text–speech outputs leads to cross-modal trade-offs, gradient imbalance, and reward dilution.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning (RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial.

We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients.

We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Key Contributions

Failure Mode Analysis

Identify and characterize key failure modes of unified sequence-level preference optimization for mixed text–speech outputs, including weak cross-modal coupling, gradient-energy imbalance, and noisy acoustic rewards.

Adaptive Hybrid Training

A single-stage adaptive hybrid post-training scheme that applies preference optimization to text tokens while anchoring speech tokens with SFT, coupled with a rollout-reliability gating mechanism.

Consistent Gains

Experiments across architectures (VITA-Audio, KimiAudio) and benchmarks show consistent gains in both semantic quality (IQ) and acoustic expressiveness (EQ).

Method

Overview of the proposed single-stage adaptive hybrid post-training
Overview of the proposed single-stage adaptive hybrid post-training. SFT anchors the speech distribution while preference optimization (GRPO) refines semantic behavior on text tokens. A lightweight gating mechanism dynamically adjusts the SFT/RL mixture based on rollout quality.

Our dynamic hybrid post-training objective combines SFT and GRPO in a single loop:

Lhybrid(θ) = (1 − λt) · LSFT(θ) + λt · L(T)GRPO(θ)

where L(T)GRPO restricts preference gradients to text tokens only, and λt is dynamically gated by two signals:

  • Direction gate gt(R): activates when at least one rollout exceeds acceptable quality (Rmax > 3)
  • Information gate gt(V): activates when rollout rewards are well-separated (high normalized variance)

An EMA smoothing (α = 0.9) stabilizes the weight across steps, and a minimum SFT fraction (1 − λmax = 0.2) is always retained as a safety anchor against acoustic drift.

Empirical Observations

Token-level probability change

Observation 1: SFT yields larger, coherent distribution shifts across the full sequence, while stabilized PO/RL updates are smaller and localized under trust-region constraints.

Judge consistency

Observation 2: Reward-model judgments agree with humans more strongly on semantic quality than on acoustic quality, where agreement is weaker and more variable.

Gradient geometry

Observation 3: Preference gradients concentrate on semantics; full-token PO yields low-SNR, high-variance updates on dense acoustics with near-zero cross-modal coupling.

Rollout diversity

Observation 4: Rollout discriminability is uneven and stage-dependent — diversity/variance is often weakest along acoustics, favoring adaptive gating over fixed mixing.

Experimental Results

Intelligence (IQ) — VoiceBench & OpenAudioBench

Our dynamic hybrid method achieves the strongest overall IQ among all compared methods across both VITA (interleaved) and KimiAudio (parallel) architectures. Restricting preference updates to text tokens yields more reliable IQ gains than full-token optimization.

Method VoiceBench OpenAudioBench
AlpacaCommonWildSD-QAMMSUOBQABBHIFEvalAdv AlpacaLlamaReasonTrivialWeb
VITA Architecture (Interleaved)
VITA-Base3.833.443.0929.248.774.358.226.294.160.673.844.242.953.5
SFT3.453.122.8527.645.171.554.928.499.255.671.138.439.948.3
Full-Token DPO3.603.292.8930.244.769.256.822.665.020.155.433.429.836.6
Text-Token DPO3.913.323.1331.145.669.760.332.871.357.274.343.143.154.3
Full-Token RL4.033.453.1929.949.074.155.629.496.363.873.343.743.352.8
Text-Token RL4.093.443.2031.350.075.456.730.296.364.674.644.444.453.2
SFT + RL (Two-Stage)3.493.322.6922.544.770.854.225.598.854.066.232.835.149.0
Ours (Dynamic)4.223.513.2931.551.477.159.932.597.168.474.646.144.454.7
KimiAudio Architecture (Parallel)
KimiAudio-Base4.463.973.4263.162.283.564.261.1100.075.779.358.062.170.2
SFT4.153.653.1059.858.479.561.264.5100.071.475.252.858.466.9
Full-Token DPO4.053.603.0558.255.176.859.458.488.568.270.450.155.365.1
Full-Token RL4.524.053.5065.263.884.664.862.8100.075.878.558.861.271.5
Ours (Dynamic)4.584.223.6867.966.587.168.366.899.578.581.261.561.871.1

Expressiveness (EQ) — VStyle

On VStyle, SFT is competitive for style instruction following, but naive full-token DPO exhibits severe degradation. Our method achieves the best aggregate EQ across dimensions on both architectures, yielding a better IQ–EQ Pareto trade-off.

MethodAcousticInstructionRole PlayEmpathy
VITA Architecture
VITA-Base2.261.762.154.01
SFT2.342.292.313.42
Full-Token DPO1.491.251.101.05
Text-Token DPO2.031.642.194.38
Full-Token RL2.161.641.973.95
Text-Token RL2.211.932.084.02
Ours (Dynamic)2.552.252.414.44
KimiAudio Architecture
KimiAudio-Base2.532.311.733.67
SFT2.652.581.953.65
Full-Token DPO1.851.551.302.10
Full-Token RL2.582.251.883.88
Ours (Dynamic)2.782.522.154.15

Ablation: Weighting Schemes & Optimization Scope

ScopeStrategyIQEQ
All TokensFixed (0.5/0.5)48.702.48
Text TokensFixed (0.5/0.5)52.602.60
Text TokensFixed (0.7 SFT / 0.3 RL)49.942.72
All TokensDynamic Weights48.842.50
Text TokensDynamic w/o EMA53.152.53
Text TokensOurs (Dynamic)55.242.92

Human Subjective Evaluation

Side-by-side human study comparing Ours with the Original Model on VITA-Audio. Our model is significantly preferred in both helpfulness and naturalness, achieving a 3:1 win-to-loss ratio overall.

DimensionWin (%)Tie (%)Loss (%)p-value
Helpfulness62.517.520.00.0046
Naturalness65.015.020.00.0029
Overall67.515.017.5< 0.001

BibTeX

@inproceedings{chen2026wavalign,
  title={WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training},
  author={Chen, Yifu and Ji, Shengpeng and Chen, Qian and Liang, Tianle and Li, Yangzhuo and Wang, Ziqing and Wang, Wen and Lu, Jingyu and Wang, Haoxiao and Pu, Xueyi and Zhuo, Fan and Zhao, Zhou},
  booktitle={Proceedings of ACL},
  year={2026}
}