basque center for applied mathematics

Hemen zaude:

1. Ikerketa
2. Simulazioak
3. Bandit problems

## Ikerketa esparruak

### Bandit problems

Peter Jacko

Jacko (2009) unifies into a comprehensive framework several well-known problems in combinatorial optimization, management, stochastic scheduling, and queueing theory.
The following paragraphs are dedicated to outline a couple of special cases of the stochastic and dynamic resource allocation problem known as the bandit problems.

Suppose that there are several competitors (or projects, or bandits) evolving in a stochastic manner, which simultaneously demand or service from a server that is able to serve only one competitor at time. We can decide to which competitor the server is allocated at any given moment. The competitors evolution yields certain aggregate reward, i.e., a sum of individual customer rewards that only depend on the current state of the competitor and whether it is currently being served or not. The objective is to allocate the server in such a way that the aggregate reward is maximal, either in discounted expectation or on time-average.

In a more general variant, we may consider a capacity of a resource divisible into a finite number of units and state-dependent (discrete) capacity demands from the competitors, instead of the server able to serve strictly one competitor at time.
Since this problem is PSPACE-hard to solve optimally, we are typically interested in an approximate solution which is easy to implement in practice. However, for some special cases it is possible to characterize an optimal solution.

Note that the problem when the resource and the competitor demands (seen as goods and services) are continuously-divisible is the fundamental one in neoclassical economics build up from marginalism. It should therefore not be surprising that marginal concepts also appear in the (approximate) solution to the discretely-divisible problem considered here.

### Multi-Armed Bandit Problem (Classic Bandits)

Imagine a gambler in casino faced with a (fixed) number of one-armed bandit machines (or slot machines). Suppose that she is there alone and forever, and every minute she can and must choose one of the machines to play (to pull the arm). The played machine randomly evolves and yields a payoff according to its current state. The objective of the gambler is to maximize the expected discounted payoff obtained over an infinite horizon.

Imagine a physician faced with a flow of patients with a new disease. Suppose that every day she can and must choose one of the available treatments to apply to the patient coming that day. The treated patient reacts to the treatment and it is immediately known whether the treatment was successful (healing the patient) or not (dying). The objective of the physician is to heal as many patients as possible, i.e., to maximize the expected discounted number of healed patients over an infinite horizon.

The above two real-life problems lead to the same mathematical model of learning known as the multi-armed bandit problem. Its main distinguishing feature is the freezing property: the states of non-played machines (or non-used treatments) do not change. In fact, in the first case, the gambler is to learn which machine is the best one on average, and to exploit all the machines in favorable states during the process of learning. Similarly in the second case, the physician is to learn which treatment is the best on average, and to exploit all the other treatments in favorable states during the process of learning.

Gittins and his colleagues in the early 1970s showed that there are certain state-dependent prices (now called the Gittins index values) that can be calculated for each bandit independently of the others, and such that the problem can be optimally solved by an index rule. The index rule prescribes to index (or sort) at every time period the bandits according to their prices (associated to their current states) and to play anyone of those with the highest price (colored in green in the video).

The video shows evolution of the prices (or the Gittins index values) in the multi-armed bandit problem. Recall the freezing property: the prices of the non-played bandits do not change, since the states of such bandits do not change.

### Restless Bandits

In the problem of restless bandits the freezing property is dropped. This model was introduced by Whittle (1988), where he proposed to calculate the price (sometimes called the Whittle index value) that generalizes the Gittins index value to this case. However, the Whittle index values do not necessarily exists, and further, the index rule is in general non-optimal. Nevertheless, a growing body of literature shows that for many real-life problems the existence of such prices can be established and that the index rule performs, on average, very close to optimality.

Back to list