WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen^1*, Shengpeng Ji^1*, Qian Chen^2*, Tianle Liang¹, Yangzhuo Li¹, Ziqing Wang³, Wen Wang², Jingyu Lu¹, Haoxiao Wang¹, Xueyi Pu¹, Fan Zhuo¹, Zhou Zhao^1†

¹Zhejiang University, ²Alibaba Group, ³Beijing University of Technology

^*Equal contribution, ^†Corresponding author

Paper Code Data

Motivation and failure mode of unified RL for end-to-end spoken dialogue models. Applying a single preference objective to mixed text–speech outputs leads to cross-modal trade-offs, gradient imbalance, and reward dilution.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning (RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial.

We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients.

We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

Key Contributions

Failure Mode Analysis

Identify and characterize key failure modes of unified sequence-level preference optimization for mixed text–speech outputs, including weak cross-modal coupling, gradient-energy imbalance, and noisy acoustic rewards.

Adaptive Hybrid Training

A single-stage adaptive hybrid post-training scheme that applies preference optimization to text tokens while anchoring speech tokens with SFT, coupled with a rollout-reliability gating mechanism.

Consistent Gains

Experiments across architectures (VITA-Audio, KimiAudio) and benchmarks show consistent gains in both semantic quality (IQ) and acoustic expressiveness (EQ).

Method

Overview of the proposed single-stage adaptive hybrid post-training. SFT anchors the speech distribution while preference optimization (GRPO) refines semantic behavior on text tokens. A lightweight gating mechanism dynamically adjusts the SFT/RL mixture based on rollout quality.

Our dynamic hybrid post-training objective combines SFT and GRPO in a single loop:

L_hybrid(θ) = (1 − λ_t) · L_SFT(θ) + λ_t · L^(T)_GRPO(θ)

where L^(T)_GRPO restricts preference gradients to text tokens only, and λ_t is dynamically gated by two signals:

Direction gate g_t(R): activates when at least one rollout exceeds acceptable quality (R_max > 3)
Information gate g_t(V): activates when rollout rewards are well-separated (high normalized variance)

An EMA smoothing (α = 0.9) stabilizes the weight across steps, and a minimum SFT fraction (1 − λ_max = 0.2) is always retained as a safety anchor against acoustic drift.

Empirical Observations

Observation 1: SFT yields larger, coherent distribution shifts across the full sequence, while stabilized PO/RL updates are smaller and localized under trust-region constraints.

Observation 2: Reward-model judgments agree with humans more strongly on semantic quality than on acoustic quality, where agreement is weaker and more variable.

Observation 3: Preference gradients concentrate on semantics; full-token PO yields low-SNR, high-variance updates on dense acoustics with near-zero cross-modal coupling.

Observation 4: Rollout discriminability is uneven and stage-dependent — diversity/variance is often weakest along acoustics, favoring adaptive gating over fixed mixing.

Experimental Results

Intelligence (IQ) — VoiceBench & OpenAudioBench

Our dynamic hybrid method achieves the strongest overall IQ among all compared methods across both VITA (interleaved) and KimiAudio (parallel) architectures. Restricting preference updates to text tokens yields more reliable IQ gains than full-token optimization.

Method	VoiceBench									OpenAudioBench
Method	Alpaca	Common	Wild	SD-QA	MMSU	OBQA	BBH	IFEval	Adv	Alpaca	Llama	Reason	Trivial	Web
VITA Architecture (Interleaved)
VITA-Base	3.83	3.44	3.09	29.2	48.7	74.3	58.2	26.2	94.1	60.6	73.8	44.2	42.9	53.5
SFT	3.45	3.12	2.85	27.6	45.1	71.5	54.9	28.4	99.2	55.6	71.1	38.4	39.9	48.3
Full-Token DPO	3.60	3.29	2.89	30.2	44.7	69.2	56.8	22.6	65.0	20.1	55.4	33.4	29.8	36.6
Text-Token DPO	3.91	3.32	3.13	31.1	45.6	69.7	60.3	32.8	71.3	57.2	74.3	43.1	43.1	54.3
Full-Token RL	4.03	3.45	3.19	29.9	49.0	74.1	55.6	29.4	96.3	63.8	73.3	43.7	43.3	52.8
Text-Token RL	4.09	3.44	3.20	31.3	50.0	75.4	56.7	30.2	96.3	64.6	74.6	44.4	44.4	53.2
SFT + RL (Two-Stage)	3.49	3.32	2.69	22.5	44.7	70.8	54.2	25.5	98.8	54.0	66.2	32.8	35.1	49.0
Ours (Dynamic)	4.22	3.51	3.29	31.5	51.4	77.1	59.9	32.5	97.1	68.4	74.6	46.1	44.4	54.7
KimiAudio Architecture (Parallel)
KimiAudio-Base	4.46	3.97	3.42	63.1	62.2	83.5	64.2	61.1	100.0	75.7	79.3	58.0	62.1	70.2
SFT	4.15	3.65	3.10	59.8	58.4	79.5	61.2	64.5	100.0	71.4	75.2	52.8	58.4	66.9
Full-Token DPO	4.05	3.60	3.05	58.2	55.1	76.8	59.4	58.4	88.5	68.2	70.4	50.1	55.3	65.1
Full-Token RL	4.52	4.05	3.50	65.2	63.8	84.6	64.8	62.8	100.0	75.8	78.5	58.8	61.2	71.5
Ours (Dynamic)	4.58	4.22	3.68	67.9	66.5	87.1	68.3	66.8	99.5	78.5	81.2	61.5	61.8	71.1

Expressiveness (EQ) — VStyle

On VStyle, SFT is competitive for style instruction following, but naive full-token DPO exhibits severe degradation. Our method achieves the best aggregate EQ across dimensions on both architectures, yielding a better IQ–EQ Pareto trade-off.

Method	Acoustic	Instruction	Role Play	Empathy
VITA Architecture
VITA-Base	2.26	1.76	2.15	4.01
SFT	2.34	2.29	2.31	3.42
Full-Token DPO	1.49	1.25	1.10	1.05
Text-Token DPO	2.03	1.64	2.19	4.38
Full-Token RL	2.16	1.64	1.97	3.95
Text-Token RL	2.21	1.93	2.08	4.02
Ours (Dynamic)	2.55	2.25	2.41	4.44
KimiAudio Architecture
KimiAudio-Base	2.53	2.31	1.73	3.67
SFT	2.65	2.58	1.95	3.65
Full-Token DPO	1.85	1.55	1.30	2.10
Full-Token RL	2.58	2.25	1.88	3.88
Ours (Dynamic)	2.78	2.52	2.15	4.15

Ablation: Weighting Schemes & Optimization Scope

Scope	Strategy	IQ	EQ
All Tokens	Fixed (0.5/0.5)	48.70	2.48
Text Tokens	Fixed (0.5/0.5)	52.60	2.60
Text Tokens	Fixed (0.7 SFT / 0.3 RL)	49.94	2.72
All Tokens	Dynamic Weights	48.84	2.50
Text Tokens	Dynamic w/o EMA	53.15	2.53
Text Tokens	Ours (Dynamic)	55.24	2.92

Human Subjective Evaluation

Side-by-side human study comparing Ours with the Original Model on VITA-Audio. Our model is significantly preferred in both helpfulness and naturalness, achieving a 3:1 win-to-loss ratio overall.

Dimension	Win (%)	Tie (%)	Loss (%)	p-value
Helpfulness	62.5	17.5	20.0	0.0046
Naturalness	65.0	15.0	20.0	0.0029
Overall	67.5	15.0	17.5	< 0.001

BibTeX

@inproceedings{chen2026wavalign,
  title={WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training},
  author={Chen, Yifu and Ji, Shengpeng and Chen, Qian and Liang, Tianle and Li, Yangzhuo and Wang, Ziqing and Wang, Wen and Lu, Jingyu and Wang, Haoxiao and Pu, Xueyi and Zhuo, Fan and Zhao, Zhou},
  booktitle={Proceedings of ACL},
  year={2026}
}