Batch Arm Pulls for Stochastic Multi-Armed Bandits

7 minute read


I took the course Advanced Concentration Inequalities in Autumn 2019 offered by Prof. Sharayu Moharir at IIT Bombay. This is a blog post about the work I did for the course project with Shashwat Shukla and Sucheta Ravikanti on designing regret minimization algorithms for the problem of batch pulls in stochastic multi-armed bandits.

In the batch setting, multiple arms are pulled at the same time, and the rewards from these pulls are returned together at the end of the round. Thus, we must jointly pick $K$ arms to explore and exploit in this setting, and it was this richer setting that was of interest to us. Applications of this model include parallel computing, advertisement models, mining exploration, etc. We designed algorithms both for the setting of minimizing simple regret (which corresponds to pure exploration) as well as for cumulative regret (which balances exploration and exploitation).

For the case of simple regret, we proposed an algorithm that perturbs the current estimates of the mean arm rewards and then picks the top $K$ perturbed arms in each round. Our algorithm has a lower sample complexity than the previous state-of-the-art algorithm by approximately a factor of $N/K$, where $N$ is the total number of arms, while still returning an answer after the same number of rounds. This is a massive reduction in sample complexity (say $N=1000$ and $K=10$), and we verified the correctness of our algorithm via extensive simulations. Our proposed algorithm is essentially a randomized form of UCB and it is thus very interesting to note that a UCB-like algorithm is also working so well for the pure exploration problem of reporting the top-K arms. The beautiful understanding that emerges from our study is that in a finite time setting, this UCB-like algorithm balances safe exploration with ambitious exploration, and is thus a useful and very different viewpoint from the typical exploration-exploitation tradeoff. Due to the limited time-horizon available, this tradeoff is beneficial and allows us to significantly outperform previously proposed uniform sampling methods. Thus our algorithm is strongly reminiscent of the lil’UCB algorithm proposed for the $K=1$ setting.

We also designed novel algorithms for minimizing cumulative regret in the setting of batch pulls for Gaussian Process Multi-Armed Bandits. We were interested in this setting because with prior information about correlated values of ‘nearby’ arms, finding the optimal set of arms to pull is a richer problem. Gaussian processes allow us to formalize our prior beliefs about these correlations between mean rewards values of arms while also allowing us to compute the posterior to make principled online decisions. We designed batch versions of GP-UCB that go beyond the naive choice of confidence bounds for the rewards from $K$-arm subsets. We did this by using the simple observation that the variance of the sum of correlated random variables, mean arm rewards in this case, is not equal to the sum of their variances. Another interesting observation we made is that the update equation for the covariance matrix of the Gaussian Process only uses the locations of sampled arms. We used this to design a new algorithm that uses ‘virtual’ arm pulls within a batch to pick the $K$ arms in a way that is better at balancing exploration and exploitation in this interesting setting. We also compared these batch GP-UCB algorithms to our proposed batch GP-TS algorithm.

The slides from our project presentation can be found here.

Leave a Comment