Multi-Armed Bandit

Solutions to the multi-armed bandit problem.

Introduction

The multi-armed bandit problem is well-studied problem in the area of reinforcement learning. The name of the problem is comes from the name "one-armed bandit" for slot machines. The problem involves an agent and a set of levers. On each round, the agent must choose one lever to pull. Each lever may have different payout amounts, and distributions and the goal for the agent is to maximize total reward over time.

The key to the problem is finding a balance between exploration and exploitation. An agent needs to pull the best lever available to it as many times as possible in order to maximize reward, but the agent must explore the available levers in order to determine which is best.

Notation and Definitions

An agent is an entity that exists in an environment and can make decisions and take actions within that environment. In these examples time progresses discretely in the environment. For each time the state of the environment is , where is the state space. In each round (each timestep), the agent can take an action , where is the action space. Taking an action results in a reward , where is the reward space.

In this project, an action is the choice of the lever to pull. The variable typically denotes the number of levers. Taking an action results in a reward.

The true value of an action is denoted by and the value of the action, as estimated by the agent, at time is .

One simple formula for estimating the value of an action, is to calculate the sample average for the action. If an action has been taken times, resulting in the rewards , the agent can use the formula:

By estimating the value of each lever at a given time, the agent can identify the lever with the highest estimated value. The action of pulling this lever is called the greedy action. The action with the greatest true value is denoted by .

A policy defines an agent's behavior. On each timestep an agent uses its policy to examine the state of the environment and take an action , resulting in a reward .

Formally, the goal of the problem is to maximize the cumulative reward function:

Solutions

Epsilon-Greedy Strategy

The best lever is selected for a proportion 1 - ε of the trials, and a lever is selected at random (with uniform probability) for a proportion ε . A typical parameter value might be ε = 0.1, but this can vary widely depending on circumstances and predilections.

Epsilon-First Strategy

A pure exploration phase is followed by a pure exploitation phase. For N trials in total, the exploration phase occupies εN trials and the exploitation phase (1-ε)N trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected.

Epsilon-Decreasing Strategy

Similar to the epsilon-greedy strategy, except that the value of decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.

Softmax Boltzmann Strategy

Select actions according to probabilities, rather than selecting the best action or a random action (as in the epsilon-_ strategies). On each round, calculate a probability of selection for each possible action (using the action's value estimate), and then select an action according to those probabilities. The preceding description applies for all softmax strategies, and the difference between difference softmax strategies is the calculation of the probabilities. In the Boltzmann case, for each action , the probability of selection is:

Value-Difference Based Exploration (VDBE) Strategy

The Value-Difference Based Exploration strategy is built upon the epsilon-greedy strategy, and, similar to the epsilon-decreasing strategy, the value of epsilon is updated over time. In the VDBE strategy, the value of epsilon is decreased as the agent becomes more sure of its environment and increases if the agent finds that its understanding of the environment is wrong. This is calculated by taking the difference between the estimated value of an action and the reward received from taking that action (this term is called the temporal-difference error). For large temporal difference errors, the value of epsilon is increased, and for small errors, epsilon is decreased. See [2] for details.

Sources

Sutton, Richard S. & Barto, Andrew G. (1998). Reinforcement learning: an introduction. Cambridge, MA: The MIT Press.
Tokic M. (2010) Adaptive ε-Greedy Exploration in Reinforcement Learning Based on Value Differences. In: Dillmann R., Beyerer J., Hanebeck U.D., Schultz T. (eds) KI 2010: Advances in Artificial Intelligence. KI 2010. Lecture Notes in Computer Science, vol 6359. Springer, Berlin, Heidelberg