Publication
- "Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration." AISTATS, 2026.
- “Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions.” ICML, 2026
ICML 2026
Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions
"Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions”
We study linear dueling bandits in volatile environments characterized by the simultaneous presence of post-serving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget C. To address these challenges, we propose RCDP-UCB, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of Õ(d(√T + 𝒞 + 𝒟)), where d is the total feature dimension and D encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a √d factor for adversarial delays in the absence of post-serving contexts.
"Neural Variance-aware Dueling Bandits with Deep Representation and Shallow Exploration”
RLHF is essential for aligning AI with human intent because it transforms the subjective task of evaluation into a series of pairwise "duels" between model outputs, a framework known as the Contextual Dueling Bandit problem. This approach is fundamentally more reliable than absolute scoring, as humans are naturally better at making relative comparisons than assigning consistent numerical ratings. Our research introduces NVLDB, which bridges this alignment process with deep learning by using neural networks to approximate complex, non-linear human preferences while proving that the cumulative error, or regret, remains sublinear under standard assumptions. Unlike previous neural dueling bandit methods that incurred massive computational overhead by using the gradients of all learnable parameters and required an impractical network width of T raised to the power of fourteen, our approach utilizes a "shallow exploration" strategy that focuses only on the final layer's gradients. This not only significantly improves computational efficiency and reduces the required network width to a more realistic T raised to the power of six, but also introduces variance-awareness to better handle the inherent noise in human preference feedback.
