An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

arXiv:2412.02861v1 Announce Type: new
Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems, where the agent receives binary rewards with probabilities determined by a logistic function $exp(beta langle a, theta rangle)/(1+exp(beta langle a, theta rangle))$. We focus on the setting where the action $a$ and parameter $theta$ lie within the $d$-dimensional unit ball with the action space encompassing the parameter space. Adopting the information-theoretic framework introduced by (Russo $&$ Van Roy, 2015), we analyze the information ratio, which is defined as the ratio of the expected squared difference between the optimal and actual rewards to the mutual information between the optimal action and the reward. Improving upon previous results, we establish that the information ratio is bounded by $tfrac{9}{2}d$. Notably, we obtain a regret bound in $O(dsqrt{T log(beta T/d)})$ that depends only logarithmically on the parameter $beta$.

Amaury Gouverneur, Borja Rodr’iguez-G’alvez, Tobias J. Oechtering, Mikael Skoglund

Go to original source