# An Introduction to Multi-Armed Bandits and Markov Decison Processes

Published:

I took the course Foundations of Intelligent and Learning Agents in Autumn 2019 offered by Prof. Shivaram Kalyanakrishnan at IIT Bombay. I found the course content very exciting and decided to share some basic intuition and ideas from multi-armed bandits and Markov decision processes in this blog post.

Consider an intelligent agent $M$ in an environment $E$ where it can take possible actions $A = {a_1, \cdots, a_n}$. At each discrete time step $t$, the agent will get a reward $r_t$ and it will move in the state space from its previous state $s_t$ to $s_{t+1}$. The objective of the agent is to maximize its cumulative reward in the long run i.e. $\sum_{t=1}^{\infty} r_t$.

To do this agent needs to do two things:

1. Exploration: Explore $E$ to ensure that it has a good model of the environment it is in by taking different actions in different states.
2. Sequential Decision Making: Take its previous states into account to take actions based on the model of $E$ that it has formed.

Reinforcement learning is a combination of exploration and sequential decision making. We study it in three steps:

1. Multi-armed Bandits (MABs): the idea of exploration and exploitation (without accounting for states)
2. Markov Decision Processes (MDPs): the idea of states and transitions between states
3. Reinforcement Learning (RL): combining and two ideas and designing algorithms that explore and learn the environment while accounting for previous states

## Multi-armed Bandits

Consider 3 coins with unknown biases (the probabilities of getting heads). You are an agent who has to toss a coin once every minute for the next 100 minutes. Your objective is to maximize the number of heads. What will you do? Ideally, you would want to keep tossing the coin with the maximum probability of heads but the biases are not known and need to be estimated. So you need to explore the coins to estimate probabilities and then choose the most promising coins. One way would be to toss each coin 10 times, calculate the estimates of probabilities and then keep tossing the coin for the remaining 70 time-steps. But, is it the optimal way?

Before answering this, we’ll formalize the problem. Let the MAB have $n$ arms ($n=3$ coins in above examples) where possible actions $A = \{ a_1, \cdots, a_n \}$ correspond to pulling the respective arms. We will keep the problem simple by using only Bernoulli rewards. The mean reward of $a_i$ is $p_i$ and $T$ is the total number of times steps (also called horizon or budget). Let $a^*$ denote the optimal arm and $p^*$ be the corresponding reward.

We define the notion of regret to measure of optimality of algorithm. Cumulative regret at time step $T$ is defined as $R(T) = \sum_{t=1}^T [p^* - \mathbb{E}(r_t)]$

where $r_t$ is the reward from arm pulled at time t. Our objective will be to minimize regret when we evaluate the algorithms. The naive algorithm described above is not the best-known algorithm in terms of regret. In particular, it has a cumulative regret of $O(T)$ because there is a constant probability with which it won’t be sampling $a^*$ after the initial exploration. Intuitively, the best-known algorithms do the following two things:

1. They will never stop exploring and keep sampling all the arms with some probability to allow the optimal arm to eventually outperform the remaining arms in terms of probability estimates.
2. They will lower the number of times the non-optimal arms are sampled over time, with zero probability of sampling them in the limit.

Here are three such algorithms that I will not discuss in detail here but I will mention that they all have $\log(T)$ regret.

1. Upper Confidence Bound (UCB) algorithm: This algorithm chooses the arms based on an optimistic estimate of the average rewards which is given by $u_i = \hat{p}_i + \sqrt{\frac{2 \ln(T)}{T_i}}$ where $\hat{p}_i$ is the MLE estimate and $T_i$ is number of times $a_i$ has been sampled till time $T$.
2. Kullback-Leibler UCB (KL-UCB) algorithm: Very similar to UCB except that the estimate is obtained by upper bounding the KL-divergence between Bernoulli($\hat{p}_i$) and Bernoulli($u_i$).
3. Thompson Sampling algorithm: The reward estimate is obtained by sampling from a posterior distribution for each individual arm and the one with the highest reward estimate is pulled.

## Markov Decision Processes

An MDP is defined by its five parameters:

1. $S$, the set of states that agent can be in
2. $A$, the set of actions that agent can take
3. $T : S \times A \rightarrow \mathcal{P}_S$, the transition function defined on each (state, action) pair gives the probability distribution of next state over $S$
4. $R : S \times A \times S \rightarrow \mathbb{R}$, the reward function defined for each (state, action, next state) triplet
5. $\gamma$, the discount factor

Note that $\mathcal{P}_S$, is the set of all probability distributions over $S$. The reward function can be fixed as above or stochastic where the mean reward can be expressed as above. Since the sum $\sum_{t=1}^T r_t$ can blow up as $T \rightarrow \infty$, we discount the future rewards by using the discounted cumulative reward $\sum_{t=1}^T \gamma^{t-1} r_t$.

A determistic policy followed by an agent is a mapping $\pi : S \rightarrow A \$. If the agent is following a stochastic policy, mapping will be into the space of probability distributions over $A$ i.e. $\pi : S \rightarrow \mathcal{P}_A$. If the agent is in state $s^t$ at time $t$ it will choose action by using $a^t \sim \pi(s^t)$, and go to the next state decided by $s^{t+1} \sim T(s^t, a^t)$ and recieve a reward given by $r^t \sim R(s^t, a^t, s^{t+1})$ (assuming the most general case of policy, transitions and rewards being stochastic).

When an agent follows a policy in an environment, the path it follows it called a trajectory which is given by the sequence $\{s^0 a^0 r^0 s^1 a^1 r^1 \cdots \}$. A value function for a policy $\pi$, written as $V^\pi : S \rightarrow \mathbb{R}$ is a mapping from state to the discounted cumulative reward that agent will receive if it keeps following the policy $\pi$.

A policy $\pi^*$ is an optimal policy if it satisfies

$\pi^* = \arg \max_\pi V^\pi(s_i)\text{ }\forall i.$

But does every MDP has an optimal policy? I will not go into the proof here but the answer is, wait for it, a yes.

Mathematically, we define the value function as

$V^\pi (s) = \mathbb{E}_\pi \Big[ \sum_{t=0}^{\infty} \gamma^t r^t | s^0 = s, a^i = \pi (s^i) \forall i\Big].$

For a deterministic policy $\pi$, it is not hard to show that the above equation can be written recursively in terms of next state $s’$ as

$V^\pi(s) = \sum_{s'\in S} T(s, \pi(s), s') [R(s, \pi(s), s') + \gamma V^\pi(s')].$

This equation is called the Bellman’s equation. If we know all the parameters of MDP and policy, we can write this as a set of $|S|$ linear equations which can be solved to get $V^\pi(s)$. This process is called policy evaluation.

Another important function that we come across often is the action value function, denoted by $Q^\pi : S \times A \rightarrow \mathbb{R}$, which is given by

$Q^\pi (s, a) = \sum_{s' \in S} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^\pi(s') ].$

Note that $Q^\pi(s, \pi(s)) = V^\pi(s)$.

Since we know how to evaluate policies, we can now iterate over all policies and check which policy satisfies the condition of optimality to find the optimal policy. But the main problem with this method is that the number of policies is exponential in the number of states. There are better methods to find the optimal policy which we shall discuss next. Imagine you knew that $\pi^*$ is the optimal policy, so you could calculate the action-value function $Q^*(s, a)$. To get the optimal action from here, you could use $\pi^*(s) = \arg \max_a Q^\pi(s,a)$. In fact, we can also write that $V^*(s) = \max_a Q^*(s, a)$ which expands to

$V^*(s) = \max_a \sum_{s' \in S} T(s, \pi^*(s), s') [ R(s, \pi^*(s), s') + \gamma V^*(s') ].$

These set of $| S |$ equations are called Bellman’s optimality equations. Solving for the optimal policy is a matter of solving these $|S|$ equations. There are three main algorithms to solve these equations which I’ll briefly describe below.

### Value Iteration

Start from some value initialization of value function $V_0$. Keep using the equation below while using the MDP until convergence.

$V_{t+1}(s) = \max_{a \in A} \sum_{s' \in S} T(s, a, s') [ R(s, a, s') + \gamma V^*(s') ]$

It can be shown using a contraction argument that $V_t$ converges to $V^*$ in the limit.

### Linear Programming

It can be easily shown that Bellman’s optimality equations are equivalent to following the LP problem which can be solved using any known LP solver.

$\min \sum_{s \in S} V(s)$

$\text{ subject to } V(s) \geq \sum_{s' \in S} T(s, a, s') [ R(s, a, s') + \gamma V(s') ], \forall s \in S, \forall a \in A$.

### Policy Iteration

We initialize the algorithm with a random policy $\pi_0$. A state is called improvable under policy $\pi$ if $\exists a \in A, a \neq \pi(s)$ such that $Q^\pi(s, a) > Q^\pi(s, \pi(s))$. An action satisfying these conditions is called an improved action for state $s$.

At each iteration, for each state, we check if the state $s$ is improvable under current policy $\pi_t$ and then improve the policy to an take an improved action $\pi_{t+1}(s)$ until there are no improvable states. It can be shown that this algorithm strictly improves the value function at each step and converges to the optimal policy in a finite number of steps (since the number of policies is finite).

## Reinforcement Learning

In the discussion of MDPs, conveniently, all the parameters of MDP were assumed to be known. In reinforcement learning, however, we don’t have access to these parameters. One obvious way to go around this problem is to estimate the parameters by getting experience in the environment. We can maintain two 3-dimensional matrices, $\hat{T}$ for estimated state transition probability and $\hat{R}$ for estimated mean rewards.

$\hat{T}(s, a, s') = \frac{\sum_t \mathbb{1}_{s^t=s,a^t=a,s^{t+1}=s'}}{\sum_t \mathbb{1}_{s^t=s,a^t=a}}$ $\hat{R}(s, a, s') = \frac{\sum_t r^t \mathbb{1}_{s^t=s,a^t=a,s^{t+1}=s'}}{\sum_t \mathbb{1}_{s^t=s,a^t=a,s^{t+1}=s'}}$

Such algorithms which keep an estimate of the model of size complexity $O(|S|^2 |A|)$ are called model-based algorithms. On the contrary, we have model-free algorithms that have lower size complexity.