Publication

Publication

Publication

  • Oh, Youngmin, et al. "Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration." AISTATS, 2026.

AISTATS 2026

Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration
 
Oh, Youngmin, et al. "Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration." AISTATS, 2026.
RLHF is essential for aligning AI with human intent because it transforms the subjective task of evaluation into a series of pairwise "duels" between model outputs, a framework known as the Contextual Dueling Bandit problem. This approach is fundamentally more reliable than absolute scoring, as humans are naturally better at making relative comparisons than assigning consistent numerical ratings. Our research introduces NVLDB, which bridges this alignment process with deep learning by using neural networks to approximate complex, non-linear human preferences while proving that the cumulative error, or regret, remains sublinear under standard assumptions. Unlike previous neural dueling bandit methods that incurred massive computational overhead by using the gradients of all learnable parameters and required an impractical network width of T raised to the power of fourteen, our approach utilizes a "shallow exploration" strategy that focuses only on the final layer's gradients. This not only significantly improves computational efficiency and reduces the required network width to a more realistic T raised to the power of six, but also introduces variance-awareness to better handle the inherent noise in human preference feedback.