Advancing Audio-Based Text Generation with Imbalance Preference Optimization

3citations
PDFProject
3
citations
#1222
in AAAI 2025
of 3028 papers
3
Top Authors
2
Data Points

Abstract

Human feedback in generative systems is a highly active frontier of research that aims to improve the quality of generated content and align it with subjective preferences. Existing efforts predominantly focus on text-only large language models (LLMs) or text-based image generation, while cross-modal generation between audio and text remains largely unexplored. Moreover, there is currently no open-source preference dataset to support the deployment of alignment algorithms in this domain. In this work, we take audio speech translation (AST) and audio captioning (AAC) tasks as examples to explore how to enhance the performance of mainstream audio-based text generation models with limited human annotation. Specifically, we propose an novel framework named IPO that includes a model adversarial sampling concept--human annotators act as referees to determine model outcomes, using these results as pseudo-labels for the corresponding beam search hypotheses. Given these imbalance win-loss results, IPO effectively enable the two models to update interactively to win the next round of adversarial sampling. We conduct both subjective and objective evaluations to demonstrate the alignment benefits of IPO and its enhancement on model perception and generation capacities. On both AAC and AST, a few hundreds of annotations significantly enhance the weak model, and the strong model can also be encouraged to achieve new state-of-the-art results in terms of different objective metrics. Additionally, we show the extensibility of IPO by applying it to the reverse task of text-to-speech generation, improving the robustness of system on unseen reference speaker.

Citation History

Jan 27, 2026
0
Feb 4, 2026
3+3