Reduced-Variance Payoff Estimation in
Adversarial Bandit Problems
Levente Kocsis and Csaba SzepesvaĚri
Computer and Automation Research Institute of the
Hungarian Academy of Sciences, Kende u. 13-17, 1111 Budapest, Hungary
kocsis@sztaki.hu
Abstract. A natural way to compare learning methods in non-
stationary environments is to compare their regret. In this paper
we consider the regret of algorithms in adversarial multi-armed bandit
problems. We propose several methods to improve the performance of
the baseline exponentially weighted average forecaster by changing the
payoff-estimation methods. We argue that improved performance can
be achieved by constructing payoff estimation methods that produce
estimates with low variance. Our arguments are backed up by both
theoretical and empirical results. In fact, our empirical results show that
significant performance gains are possible over the baseline algorithm.
1 Introduction
Regret is the excess cost incurred by a learner due to the lack of knowledge
of the optimal solution. Since the notion of regret makes no assumptions on
the environment, comparing algorithms by their regret represents an appealing
choice for studying learning in non-stationary environments.
In this paper our focus is a slightly extended version of the adversarial bandit
problem, originally proposed by Auer et. al [1]. The model that we start from is
best described as a repeated game against an adversary using expert advice. In
each round a player must choose an expert from a finite set of experts. In the
given round the selected expert advices the player in playing a game against the
adversary. At the end of the round the reward associated with the outcome of
the game is communicated to the player. The player’s goal is to maximize his
total reward over the sequence of trials. Of course, the total reward depends on
how strong the individual experts are and hence, a more reasonable criterion is
to minimize the loss of the learner over the total reward of the best expert, i.e.,
the regret. If all t