Perturbation-Based Confidence Regions and Their Application to Stochastic Bandits

Constructing confidence intervals via data perturbation can be used as a system identification approach to estimate the parameter of linear regression models. Such a method is called Sign-perturbed Sums (SPS), which is a novel, nonparametric solution that relies only on mild statistical assumptions and provides non-asymptotic confidence set that contains the unknown system parameter with predetermined probability. The essence of this thesis is the exhaustive analysis and machine learning application of the SPS method. In the first chapter of the thesis, we give a brief insight into stochastic multi-armed bandit (MAB) problems, which are sequential decision making tasks in random environments. MAB problems illustrate the core dilemma of reinforcement learning (which is one of the three branches of machine learning), the exploration vs. exploitation dilemma. We give an overview of the key concepts of bandits and a state of the art approach, called Upper Confidence Bound policy, which is proven to be an efficient (but typically parametric) solution to tackle such stochastic sequential allocation problems. In the second part, we get acquainted with Sign-perturbed Sums (SPS) method, and we present several known properties of SPS confidence regions regarding their shape and size. Furthermore, as our contribution to the existing theory of SPS, we examine the cases of degenerate confidence sets and estimate the probability of their occurrence. We also present a generalization of the SPS method, which allows not only sign but also symmetrical data perturbation, and prove the exact confidence level of the regions provided by symmetrically-perturbed sums method. In addition, we consider further modifications that do not affect the inclusion of the system parameter with the user-chosen confidence probability. In general, the determination of SPS confidence regions in a compact representation is a computationally intensive task. However, in one dimension, the exact confidence region can be calculated efficiently in closed form. In the final, third chapter of the thesis, we examine the one-dimensional SPS method to gain deeper understanding of the construction. Using the alterations proved in the second chapter, we introduce the linear SPS method, which provides semi-infinite confidence intervals. Lastly, we propose a nonparametric, perturbation-based stochastic multi armed bandit algorithm (which relies on linear SPS). To empirically compare it with existing UCB policy, we implemented in Python all the algorithms to be examined in the thesis and ran various simulations.

Keywords

stochastic bandits, confidence regions, randomized algorithms, sign-perturbed sums

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average