Multi-armed bandits with probing multi-armed

Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost $c\geq 0$ to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as $O(\sqrt {KT\log T})$, and an order optimal gap-dependent upper bound of $O(K\log T)$. As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of $O(\sqrt {KT\log T})$, and gap-dependent regret bound of $O(K^{2}\log T)$; and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings.

Conference Name: 2024 IEEE International Symposium on Information Theory, ISIT 2024

Date of Conference: 07-12 July 2024

Country

Turkey

Related Organizations

Bilkent University
Turkey

Keywords

Information theory, Multi-armed bandit problem, Probes, Upper bound, Costs

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now