Inverse Bayesian Optimization: Learning Human Acquisition Functions in an Exploration vs Exploitation Search Task

This paper introduces a probabilistic framework to estimate parameters of an acquisition function given observed human behavior that can be modeled as a collection of sample paths from a Bayesian optimization procedure. The methodology involves defining a likelihood on observed human behavior from an optimization task, where the likelihood is parameterized by a Bayesian optimization subroutine governed by an unknown acquisition function. This structure enables us to make inference on a subject's acquisition function while allowing their behavior to deviate around the solution to the Bayesian optimization subroutine. To test our methods, we designed a sequential optimization task which forced subjects to balance exploration and exploitation in search of an invisible target location. Applying our proposed methods to the resulting data, we find that many subjects tend to exhibit exploration preferences beyond that of standard acquisition functions to capture. Guided by the model discrepancies, we augment the candidate acquisition functions to yield a superior fit to the human behavior in this task.


Introduction
The problem we study in this paper is a type of supervised learning problem: given observed human behavior from a sequential optimization task, we aim to create a model that accurately describes how they will behave on future rounds of the task. However, unlike most supervised learning problems, we explicitly assume that the observed behavior in each round of the task represents an approximate solution to a latent optimization problem faced by the subject. Not only do we want to incorporate this optimization subroutine into our model, but inferring the criteria governing the subject's optimization strategy is actually our primary interest. This type of analysis can be termed an inverse optimization.
A notable gap within the inverse optimization literature is the lack of statistical treat-ment of its methodology. Only very recently have statistical concepts such as consistency (Aswani et al., 2018), goodness of fit (Chan et al., 2019), and robustness (Ghobadi et al., 2018;Esfahani et al., 2018;Shahmoradi and Lee, 2019) been treated in methodological papers. This is surprising because many optimization problems are assumed to be carried out imperfectly, thereby introducing noise to the data. In such cases, inverse optimization becomes an inference problem well-suited to probabilistic modeling. A key contribution of our work is the formulation of an inverse decision problem via a probabilistic model, which provides a principled framework to quantify both the variability with which the subject performs the optimization as well as the uncertainty around the parameter estimates governing the underlying optimization strategy.
We illustrate our methods by analyzing human decision-making behavior from a sequential optimization task developed in collaboration with the Neuromotor Control Lab at Harvard University. As will be further explained in Section 2, the task was specifically designed to present subjects with a conflict between exploration and exploitation on the second move of the task. Our goal is to make inference on how the subjects strategically balance exploration and exploitation based on their behavior on this second move.
In order to make inference on the subjects' strategies, we require a model for their optimization procedure. As will be detailed in Section 3, our model framework draws on Bayesian optimization, which is a sequential model-based approach to maximize (or minimize) an unknown objective function (Jones et al., 1998;Shahriari et al., 2015). 1 The core idea of Bayesian optimization is to characterize the uncertainty about an unknown objective function with a statistical model, termed a surrogate, then to strategically synthesize the model uncertainty via an acquisition function in order to determine promising new locations to sample. We adopt this terminology in our paper, as the fundamental components of our optimization model follow those of Bayesian optimization.
Within this context, our goal becomes to characterize and estimate subjects' acquisition functions (i.e. their exploration/exploitation preferences) based on their observed behavior. The novelty of our method is that we define a likelihood on a subject's observed search paths parameterized by an underlying Bayesian optimization subroutine governed by an acquisition function. This structure enables us to make inference on the subject's acquisition function while allowing their behavior to deviate around the solutions to the underlying optimization subroutine. By using probabilistic models of this form, both the variability with which subjects optimize as well as the uncertainty around their estimated exploration/exploitation preferences can be quantified.
We find that a wide range of acquisition functions are exhibited across the subjects in our study, but probability of improvement (PI) and upper confidence bound (UCB) functions tend to fit the majority of subjects best. Unlike previous studies of this problem, we are able to provide credible intervals to characterize the uncertainty in the estimated acquisition functions. Finally, there are many subjects for which none of the candidate acquisition functions we initially consider provide a satisfactory model of their behavior. Guided by the model discrepancies for these subjects, we propose an augmentation to the acquisition functions to construct a superior model of human optimization behavior in this task.

Related work and contributions
A number of recent papers investigate the correspondence between ML algorithms and actual human behavior for various decision-making processes (Borji and Itti, 2013;Wilson et al., 2015;Schulz et al., 2015;Wu et al., 2018;Plonsky et al., 2019;Gershman, 2019;Candelieri et al., 2020). Within this context, Borji and Itti (2013) compare several ML techniques against observed human behavior in a set of sequential optimization tasks that require subjects to make choices along one continuous dimension, finding that Bayesian optimization algorithms best approximate the human behavior. Recent papers expand their work to tasks that require discrete choices (Wu et al., 2018) and tasks where the decision space is continuous in two dimensions (Candelieri et al., 2020). These studies essentially explore the inverse problem of Bayesian optimization. Our research builds on these studies by introducing a probabilistic framework for the inverse problem, which we introduce in Section 4.
More broadly, our work contributes to the literature on inverse optimization (Ahuja and Orlin, 2001) and related problems such as inverse reinforcement learning (Ng et al., 2000) and inverse decision theory (Swartz et al., 2006). Li (2020) is particularly relevant to our project, which uses inverse optimization to learn a convex function representative of a subject's risk preferences. The estimand in our project-the acquisition function of a Bayesian optimization-is analogous to the risk preferences estimated in Li's paper.

Outline
The rest of this paper is outlined as follows. In Section 2, we describe the search task and explain how it forces subjects to make an exploration vs exploitation trade-off on the second move of the task. In Section 3, we frame the search task as a Bayesian optimization procedure and show how the forward optimization proceeds under three acquisition function classes. Section 4 introduces the inverse problem and illustrates our proposed solution framework based on the second move of the search task. We also augment the acquisition functions to enable a more accurate fit to the human tendencies we observe. In Section 5 we present our results and explore relationships between the estimated acquisition functions and subjects' corresponding performance in the task. Section 6 summarizes our contributions and suggests directions for future work.
All of our data and code to reproduce our results is hosted publicly at the first author's GitHub page: https://github.com/nsandholtz/hotspot_paper.

Search Task Description
In collaboration with the Neuromotor Control Lab at Harvard University, we designed a "hotspot" search task to present subjects with exploration vs exploitation conflicts while searching for an unknown optimal location. On each round, subjects searched for an invisible target location (a "hotspot") randomly placed on a 9-inch diameter circular task-region shown to them on a computer monitor. Rounds consisted of 3 to 10 moves, where the number of moves was determined randomly according to a uniform distribution and the number was unbeknownst to the subject until the end of the round.
Within a round, a "move" entails sampling a location on the task region (i.e. clicking on it) after which a numerical score is immediately displayed to the user proportional to the click location's proximity to the hotspot. Specifically, the reward received on move t is a deterministic function of the distance from the click location to the hotspot: where r 0 denotes the initial score, d(m t ) denotes the euclidean distance (in pixels) of move m t (a two-dimensional coordinate) to the hotspot, and k denotes the reward scale (i.e. the reward's sensitivity to hotspot distance). The reward scale k is randomly generated for each new round of the task according to a Uniform[0, 5/3] points/pixel distribution and is opaque to the subject. This prevents subjects from gaining information about the objective function along the orthogonal direction of their first move, thus ensuring that the second move presents a tradeoff between exploration and exploitation. 2 To minimize effects of the task-region boundary guiding subject search behavior, the task always began at the center of the task region. We refer to this initial starting point as "move 0". Each subsequent move was constrained to be within a small circular region (0.2 inch radius) around the previous move. Subjects performed many rounds of this task; across the 28 subjects who participated, the minimum number of rounds played was 228 and the maximum was 716. Panel (a) of Figure 1 displays an example round of the experiment.
We are primarily interested in the second move (move 2) of the task because this move provides the most information about a subject's exploration vs. exploitation preferences. The first move (move 1) a subject makes is virtually random since they have no information about the direction of the hotspot upon beginning each round, as illustrated in panel (b) of Figure 1. After receiving feedback from their first move, the subject acquires information about the gradient of the objective function but only along a single direction. This poses a conflict between exploration and exploitation on the second move: continuing along the direction of their first move represents pure exploitation of their current knowledge, while any deviation from this direction represents some degree of exploration. Thus the combination of a subject's first and second moves provides direct insight into how they balance exploration and exploitation, as illustrated in panel (c) of Figure 1.  Figure 1: (a): An example round of the hotspot task. The red target shows the invisible hotspot location and the dots track the subject's search path. Each move's score is shown in a text box. The score at the hotspot is the score the subject would be given if they sampled the hotspot. (b): Initial state (move 0) for the example round shown in (a), zoomed in to the move 1 click-region. Moving in any direction represents exploration, as shown by the dashed green lines. (c): The subject moves to the upper edge of the move 1 click-region, receiving a score of 98. This creates a conflict between exploration (green) and exploitation (blue) on move 2. Moving perpendicular to direction of move 1 represents pure exploration (move 2a), while moving along the same direction of move 1 represents pure exploitation (move 2b). Any move between these extremes represents a combination of exploration and exploitation (move 2c).
In the next section we frame this task as a Bayesian optimization procedure. Ultimately, our goal is to characterize and estimate subjects' exploration/exploitation preferences via the acquisition function of the Bayesian optimization model of their behavior on the first and second moves of the task.

Optimization Framework
Mathematically, the subjects' goal is to maximize an unknown objective function f where m is a two-dimensional location and M is the circular task region defined on R 2 . However, as each move is constrained to be within a small region about the location of their previous move, the problem can be more precisely characterized as where M i is the local click-region surrounding m i−1 as described in Section 2, and k is randomly selected from the uniform distribution on the integers from 3 to 10.
In order to make inference on how the subject's solve (3.2), we model their optimization procedure as a Bayesian optimization (Borji and Itti, 2013). This entails characterizing the uncertainty about the unknown objective function with a surrogate, then strategically synthesizing the surrogate uncertainty via an acquisition function in order to determine promising new locations to sample. We define this framework in the context of the first and second moves of the task, as we exclusively consider these moves in the inverse problem. Appendix A in the supplemental material to this paper explains why moves 3 and up provide comparatively little information about the subjects' acquisition preferences.

Choosing a surrogate
A 'surrogate' is simply a term for a statistical model with an emphasis on pragmatism and prediction rather than interpretability and identification (Gramacy, 2020). Bayesian optimization applications typically use Gaussian process (GP) surrogates, which is a highly flexible class of models that can easily be updated as new samples from the objective are obtained.
In our case, GP surrogates would inaccurately specify the subjects' beliefs about the objective given their prior knowledge of the reward structure, and would therefore invalidate any inference about their acquisition function. Subjects know a priori that there is an optimum somewhere on the circular task-region and that the rewards decrease uniformly and monotonically from this hotspot with distance. This reward structure is illustrated in Figure 2, which shows the objective function from the example round shown previously.
As illustrated in the left plot, globally the objective function is conical. However, the experiments were designed to yield a surface that is approximately linear in the local region around each individual move, as shown in the right panel. Given the subjects' prior awareness of the reward structure, in addition to the approximate linearity in the neighborhood defined by a move's click-region, we assume that the subjects use a linear model as a surrogate of the objective: where m 1 = (x 1 , y 1 ), β = β x β y represents the reward gradient with respect to the Cartesian plane and 1 represents a deviation from the surrogate to the true objective, which we assume to be Gaussian. We fix σ s at a tiny value (σ s = 0.01) because the deviations from the linear model to the objective are negligible in the local region about m 1 . 3 The approximate linearity of the objective function in the click-region around each move is an important feature of the experiment, as it is what allows us to assume that the directions of exploration and exploitation are orthogonal on move 2. We note, however, that the fidelity of the linear approximation to f declines as the distance to the hotspot decreases. This fact required a careful design of the experiment; we wanted the majority of clicks to take place far from the hotspot to make the linear approximation more accurate, but we also wanted the hotspot to be reachable so that subjects performed the optimization seriously. Balancing these competing desires, we selected the number of moves per round and the size of the local click-region such that it would be impossible to cover the distance to the hotspot in approximately 90% of all rounds. We omitted rounds where the hotspot was reachable in our analysis.

Updating the surrogate
As subjects sample new locations and receive additional feedback, their uncertainty about the objective function diminishes. In the context of Bayesian optimization, this corresponds to updating the surrogate as new information is gained about the objective function.
We apply a Bayesian framework for the surrogate inference. Since the hotspot is randomly placed over the task-region, we assume an isotropic, zero-mean Gaussian prior for β: where σ 2 β is a hyperparameter that we select to give a weakly-informative prior on the mean. Let the observed search path after move 1 be denoted as D 1 = {M 1 , r 1 }, where M 1 is a 2 × 2 matrix with the first row equal to (0,0) and the second row equal to m 1 , and r 1 is a lenth-2 column vector of observed rewards on moves 0 and 1. Given D 1 , (3.4) can be updated yielding a posterior distribution on β. The prior is conjugate and the posterior can be computed analytically (Taboga, 2021, Online appendix): The posterior predictive distribution of the surrogate for move 2 is given by: where m 2 is a potential move 2 location and all other terms are as defined previously (Taboga, 2021, Online appendix). Figure 3 shows the updated surrogate predictive distribution for an example round after move 1.

Choosing an acquisition function
The updated surrogate can be utilized to obtain promising new locations to sample. This process is operationalized through the acquisition function. The acquisition function u is a function of a proposed location m 2 and the surrogate after move 1 (which itself is a function of D 1 ). Acquisition functions are typically constructed such that high values of the function correspond to potentially high values of the objective, either because the predicted mean is high, the uncertainty is high, or both (Brochu et al., 2010). The experimenter maximizes this function over M 2 , the set of potential move 2 locations, thus obtaining a new location to sample: The left and right plots show the posterior predictive mean and standard deviation surfaces respectively, projected onto a fine grid over the move 2 click-region.
A host of acquisition functions have been proposed in the literature (see Shahriari et al. (2015) for a review). While any acquisition function that selects new locations deterministically is a viable candidate for the methods we introduce in this paper, we restrict our analysis to probability of improvement (PI), expected improvement (EI), and an upper confidence bound (UCB) criterion since these functions are well-known among Bayesian optimization practitioners and have analytic solutions under Gaussian models (Jones et al., 1998;Srinivas et al., 2009). Mathematically, these are defined by where p > 0.5, m + 1 is the best location observed so far over the two existing samples (i.e. m + 1 = argmax m∈M1 f (m)), andF 1 is the cumulative distribution function off 1 . Each of the acquisition functions in (3.10)-(3.12) depend on an additional parameter which controls the premium on exploration. Acquiring via PI (Kushner, 1964) results in the location that most confidently predicts an increase in the response but ignores improvements less than ξ PI . EI (Mockus et al., 1978) considers the magnitude of improvement at a particular location, where ξ EI controls a tradeoff between exploration (high ξ EI ) and exploitation (low ξ EI ) (Lizotte, 2008). UCB acquisition functions (Cox and John, 1992;Srinivas et al., 2009) select new locations to sample based on optimistic estimates of the resulting reward at each location. Higher values of p encourage more exploration. 4 Figure 4 shows the acquisition surfaces for the PI, EI, and UCB functions defined in (3.10)-(3.12) in the context of the example round shown in Figure 1. We reoriented the data so that move 1 falls along the horizontal-axis for visual clarity. The arg max(s) of each surface is denoted by a green star.  Notice that each method prescribes different move 2 locations; PI recommends pure exploitation, while EI and UCB recommend almost pure exploration. Of course, this is not always the case. Depending on the change in score from move 0 to move 1, in addition to the exploration parameter values (ξ PI , ξ EI , and p), the optimal locations returned by the acquisition functions can vary substantially. Also note that each surface is symmetric across the exploitation axis (i.e. y = 0). This is a feature of the subject having gained information exclusively about a single direction after the first move. Due to this phenomenon, the surfaces may be bimodal or unimodal.
Because we use a linear model as the surrogate for the objective function, the optimal location to sample on move 2 (as determined by (3.9)) always falls on the click-region boundary, regardless of the change in score on move 1. This harmonizes with the sub-jects' actual behavior; in 95% of the rounds, subjects moved to the click-region boundary on move 2. 5 This effectively allows us to reduce the arg max search from two dimensions to one dimension. Leveraging polar coordinates, we can fix the radius of the move 2 location at the click-region boundary and limit the arg max search solely to θ 2 , the angle of move 2 relative to the direction of move 1: By reducing the dimensionality of the optimization we can illustrate acquisition values over the entire range of ∆r 1 = r 1 − r 0 , the change in score after a subject's first move, as shown in Figure 5. The three plots show acquisition values over a fine grid of (∆r 1 , θ 2 ) pairs for the same acquisition functions shown in Figure 4. In each plot, the green curves denote the angles that yield the maximum of the acquisition values as a function of ∆r 1 . Note that the move 2 locations are assumed to be made on the click-region boundary. The horizontal axis denotes ∆r 1 , the change in reward from m 0 to m 1 . The vertical axis shows θ 2 , the angle of the second move relative to the first move. Color indicates the acquisition function value for any given (∆r 1 , θ 2 ) pair. In each plot, the green curve denotes the angle that yields the maximum of the boundary acquisition values as a function of ∆r 1 . The vertical black lines at ∆r 1 = 5 correspond to the circular black lines denoting the click-region boundaries in Figure 4. Similarly, the green stars correspond to the stars in Figure 4.
Going forward, we will refer to an acquisition curve as the curve defined by the optimal θ 2 as a function of ∆r 1 for a given acquisition function. Figure 6 shows various acquisition curves from the PI, EI, and UCB acquisition families. Depending on the acquisition family and exploration parameter, the acquisition curves can vary substantially.  Our goal is not to determine an optimal strategy, or acquisition function, for this task. Rather, using Bayesian optimization as a model of each subject's optimization procedure, we aim to learn subject-specific acquisition functions guided by their observed behavior in the experiment.
Before proceeding to Section 4, which formally introduces the inverse problem, we pause to comment on the unconventional nature of the Bayesian optimization framework we have developed thus far. The reward structure and transparent experimental setup (which allows subjects to understand the reward surface in relatively few moves) were carefully designed to allow us to assume that subjects' beliefs about the reward surface can be approximated by a linear surrogate. This is crucial in enabling us to decouple uncertainty about a subject's surrogate from uncertainty about their acquisition function, thus allowing us to make plausible inference on their acquisition function.

Learning Human Acquisition Functions
In this section we propose a method to estimate a subject's unknown acquisition function u given their observed decisions on move 2 assuming that these decisions are approximately optimal solutions to (3.9) with respect to u. We say 'approximately optimal' because the subjects' decisions exhibit considerable variability, as illustrated in Figure 7 which shows the move 2 behavior for three subjects in our study as a function of ∆r 1 . 6

Subject 21
Subject 24  This variability could arise from a number of factors. For one, the modeling described in Section 3 happens subconsciously; subjects "update" their uncertainty about the objective function intuitively, which introduces human error. Noise could also be due to variation in a subject's ability to click exactly where they intend to, changes to their sampling strategy, or sloppiness from performing the task rapidly.
Conceptually, solving (4.1) can be understood as finding acquisition curves (see Figure 6) that best fit the humans' observed optimization behavior (see Figure 7). As such, the problem becomes one of inference rather than optimization. The optimization has already occurred-we want to infer how each subject optimized.

A probabilistic solution framework
We assume subjects' optimization behavior is characterized by an average strategy but that their behavior deviates randomly around this strategy from round to round. Under this assumption, we propose a probabilistic solution framework for (4.1) as follows: 1. Assume a likelihood for (θ i 2 |∆r i 1 ,f 1 , u), parameterized such that the mode of the distribution equals the arg max of the acquisition function u.
2. Select a set of candidate acquisition functions U as potential characterizations of the optimizer's preferred strategy and set a prior distribution over U.
3. Compute the posterior probability of each candidate acquisition function u ∈ U: where (·|·) is the likelihood from step 1, and p(·) is the prior from step 2.
We employ a Bayesian framework in our approach, but the procedure could be carried out using other estimators (e.g. maximum likelihood). We now explain how we implement each of these steps in context of the hotspot search task.

Choosing the likelihood
To model the symmetric bimodal nature of our data, we assume θ i 2 follows a twocomponent wrapped Cauchy mixture distribution: 7 . (4.6) The location of each mixture component in (4.4) is constrained to be at the argmax(s) defined by arg max θ∈(−π,π] u(θ|∆r i 1 ,f 1 ). The parameter γ measures a subject's variation about their acquisition curve and the mixture weight w governs a subject's propensity to favor exploring to the right over exploring to the left. We use the wrapped Cauchy distribution because it is more robust to outliers, which are fairly common across many subject's move 2 behavior.
Together, (4.4)-(4.6) yield a likelihood that adheres to the defining features of our data. It is not only defined on the proper support for θ 2 (i.e. (−π, π]), it is also guaranteed to be symmetric over the axis of exploitation (i.e. θ 2 = 0); since the argmaxes defined in the denominators of (4.5) and (4.6) will always be reflected about θ = 0, the corresponding mixture distribution will also be symmetric about θ = 0.
The fact that we parameterize the likelihood such that the mode equals the arg max of the acquisition function is what renders this an 'inverse optimization' method. The novelty is that we couch this restriction within a likelihood, thereby allowing subject behavior to deviate around an 'optimal' strategy (with respect to a given acquisition function u) according to the specified probability distribution. This combination distinguishes our approach from both from other probabilistic models and other inverse optimization methods.

Setting priors
We select the PI, EI, and UCB acquisition families defined in (3.10)-(3.12) as the set of candidate acquisition functions U. Note that our method does not preclude using other acquisition families-any acquisition function can be included in the candidate set U. Each acquisition function u ∈ U is assumed to have equal prior probability, thus we place a uniform prior over the parameters indexing the acquisition function space: ξ PI , ξ EI , p. We also put a uniform prior distribution over γ, the scale parameter governing the variability in each subject's optimization behavior. Appendix B in the supplemental material to this paper contains additional details on prior specification.

Approximating the posterior
The existence of the arg max in the parameterization of (4.4)-(4.6) creates an additional optimization step when evaluating (4.3). To ease the computational burden posed by this additional optimization subroutine, we precompute a fine grid of acquisition curves over a plausible range of values for each candidate family and restrict our inference to this discrete set. We also precompute the maximum likelihood estimate of the mixture weight w using the EM algorithm (Dempster et al., 1977). We can then compute (4.3) for each u ∈ U by multiplying its prior probability by the joint likelihood of θ 2 (plugging in the precomputedŵ) and the prior over γ. Figure 8 shows the (∆r 1 , θ 2 ) pairs for subjects 21, 24, and 28 overlaid with their corresponding maximum a posteriori (MAP) acquisition curves in color. The light-gray shaded areas show approximate 95% highest posterior density (HPD) prediction regions, while the dark-gray regions denote 95% HPD credible regions on the acquisition curves. In each plot, text in the lower-right corner shows the estimated exploration parameter for the subject's MAP acquisition function. Around each curve is an approximate 95% HPD credible region on the acquisition curve (dark-gray) and 95% HPD prediction region (light-gray).

Incorporating human tendencies
On visual inspection of Figure 8, the MAP acquisition curves for subjects 21 and 24 fit their data fairly well but best fit curve for subject 28 fits poorly. This is because there are no candidates in U that prescribe such a drastic change in strategy from negative ∆r 1 to positive ∆r 1 values while maintaining highly exploratory preferences. The set of acquisition functions as currently defined cannot yield acquisition curves with this shape. Consequently, there is no parameter setting of ξ PI , ξ EI , or p which renders subject 28's average behavior as optimal.
In order to allow this type of behavior to be rendered optimal, we propose augmenting the acquisition functions by an additional parameter τ . For a given acquisition function u, we define augmented acquisition function u as where τ ∈ [0, π/2]. Equation (4.7) defines the augmented value to be the minimum acquisition value (w.r.t. θ 2 and ∆r 1 ) for all move 2 angles that aren't at least τ -radians exploratory, either in the forward or backward directions. If a given move 2 angle is greater than τ and less then π − τ (in absolute value), the move is "sufficiently exploratory" and the acquisition value is unchanged. Move 2 Boundary UCB Figure 9: Click-region boundary augmented acquisition values over the range of possible ∆r 1 values for three sample acquisition functions. The solid green curves denote the angles that yield the maximum of u(θ 2 |∆r 1 ,f 1 , τ ), while the dashed green curves show the angles that yield the maximum of u without being augmented by τ (i.e. arg max u(θ 2 |∆r 1 ,f 1 )).
in Figure 5.
Another feature we account for is the human tendency to react differently for positive vs. negative feedback (Tversky and Kahneman, 1979). Some subjects exhibit different exploration tendencies depending on whether they get a negative or positive change in score on their first move. In order to allow for this type of behavior we modify (4.7) to allow different values of τ depending on whether ∆r 1 is positive or negative: (u(θ 2 |∆r 1 ,f 1 )) otherwise. (4.8) The practical effect of (4.7)-(4.8) is that they allow the optimization criteria to be based solely on exploration. While the parameters in the unmodified PI, EI, and UCB acquisition families allow the optimizer to balance exploration vs. exploitation differently when synthesizing the uncertainty inf , they do not enable the optimizer to let exploration completely dominate exploitation. The augmented acquisition functions allow exploration to trump exploitation, regardless off 1 . We observe this type of behavior by many subjects in our study. Figure 10 shows the data for the same subjects as before, this time overlaid with the fitted models using the augmented acquisition function in (4.8). Visually, the fit appears to be superior. Each scatterplot is overlaid with the subject's corresponding MAP augmented acquisition curve in color. Green denotes PI and red denotes UCB. Around each curve is an approximate 95% HPD credible interval on the acquisition curve (dark gray) and 95% HPD prediction interval (light gray). Point estimates for the parameters governing each subject's acquisition function are listed in the lower right corner.

Model validation
We fit the acquisition models in (3.9), (4.7), and (4.8) to each subject's data following the procedure outlined in Section 4.1. For the augmented models we estimate the τ parameters as well. As with the other exploration parameters, we set discrete uniform priors on τ , τ + , and τ − . Table 1 shows out-of-sample log-likelihoods (using an 80%/20% train/test split of the data) for each model across the subjects in our study.
The u ± model provides the best fit on aggregate, though there are some subjects for which the additional flexibility offered by the τ parameters does not substantially improve the fit. Unless otherwise specified, from this point onward all model references utilize the augmented acquisition function in (4.8).
We also have included a simulation study in Appendix C of the supplemental material to validate our methodology in a controlled setting. The study demonstrates that our method accurately recovers the true underlying acquisition function under different acquisition function parameters and observation sample sizes. It also explores how our method performs when the true acquisition function class is omitted from the inference acquisition function class.  -2036.46 -1843.54 -1709.19 5 Results Table 2 shows each subject's MAP estimates of the u ± parameters from (4.8) in columns 1-3. Columns 4 and 5 show their estimates of γ and w, respectively. Figure 14 in Appendix E of the supplemental material plots the data, MAP augmented acquisition curves, and 95% HPD intervals for all 28 subjects.
One observation that immediately jumps out from the table is that the vast majority of subjects are best represented by the UCB acquisition function with p = 0.5. 8 Before augmenting the acquisition function, this would imply that most subjects are purely exploitative on their second move, however, after augmenting the acquisition function this interpretation no longer holds. Depending on a subject's estimated values of τ − and τ + , subjects can exhibit highly exploratory behavior despite havingp = 0.5. Under the augmented acquisition function, the parameters ξ PI /ξ EI /p denote the shape of the curve more than the magnitude of their exploration preferences. Instead, the values of τ − and τ + primarily explain a subject's exploration vs. exploitation preferences in the augmented formulation. Table 2: Each subject's estimated fit to (4.4) using the augmented acquisition formulation in (4.8). Columns 1-5 show MAP parameter estimates for the acquisition family shape parameter (ξ PI /ξ EI /p), as well as estimates for τ − , τ + , γ, and w. Columns 6-8 show corresponding measures of subject performance in the search task, where ∆r i denotes the average change in score after a subject's ith move.

Fitted Model
Task Performance The estimated values of τ − and τ + in Table 2 show that while many subjects exhibit strong exploratory preferences, there is not a universal pattern exhibited across all of the subjects. Many subjects appear highly exploratory regardless of the change in score on move 1 (subjects 6, 7, 10, 12, 25, and 27) while even more subjects are always exploitative (subjects 1, 4, 8, 9, 11, 13, 20, 23, and 26). Some subjects had significantly stronger exploratory preferences when their first move improved their score (subjects 14 and 16) while others were more likely to explore when their first move yielded a negative change in their score (subjects 18 and 19).
Another behavioral tendency we can assess is the subjects' preferences toward either turning left or right on move 2 (relative to the direction of their first move). Based on the estimates of w shown in Table 2, we find that most subjects do not exhibit a strong bias for one direction over the other, but some subjects do have a strong preference to turn right after move 1 (subjects 6, 10, 21, 28) and one subject has a strong preference to turn left (subject 25).
While the focus of our paper is on making inference on human acquisition strategies when facing an exploration vs exploitation conflict, there are a few relationships between the estimated acquisition functions and subjects' corresponding performance in the task which we will briefly discuss. First, there is a strong relationship between high exploitation preferences and success in the early moves of the task. Letting τ j = ( τ − j + τ + j )/2 and ∆r 2j denote the average change in score after the second move across all of subject j's rounds of the task, the correlation between τ and ∆r 2 is -0.81. This is not surprising-subjects who explore on their second move naturally get a lower average score on this move. However, this exploratory sacrifice is balanced by the ability to better exploit the resulting information on the remaining moves in the round, as reflected in the correlation( τ , ∆r 3 ) = 0.49. Though the relationship is not equally strong, high explorers perform better on average on moves 3 and up.
Ultimately, we care most about the total score improvement on all informed moves: 10 i=2 ∆r i . We do not find a significant relationship between the estimated acquisition function parameters and this measure. However, a strong relationship exists between 10 i=2 ∆r i and 10 i=4 |∆θ i |, where ∆θ i = θ i − θ i−1 . This measures how much "zig-zag" a subject exhibits after having the necessary information to estimate a planar gradient over the click-region. As explained in Appendix A of the supplemental material, given a linear model for the surrogate, the Bayesian optimization algorithm will no longer yield drastic changes in direction after the 3rd move. By contrast, most subjects showed significantly positive values of 10 i=4 |∆θ i |. This could indicate preferences for continued exploration, but it could also result from other unintentional sources of sampling variation (e.g. sloppiness due to the speed with which subjects performed the task). Not surprisingly, the degree of "zig-zag" is related to the aggregate performance measure: correlation( 10 i=4 |∆θ i |, 10 i=2 ∆r i ) = -0.47. Subjects who kept a more constant direction after their 3rd move performed better in the task.

Conclusion
This work introduces a probabilistic framework for inverse Bayesian optimization and shows how it can be implemented through a lab-based sequential optimization experiment. Our methodology involves defining a likelihood on a subject's observed behavior from a search task, where the likelihood is parameterized by a Bayesian optimization subroutine governed by an acquisition function with parameters of its own. This novel structure enables us to make inference on the subject's acquisition function while allowing their behavior to deviate around the solution to the underlying Bayesian optimiza-tion subroutine. This allows us to simultaneously make probabilistic inference about the subject's acquisition function in addition to providing a predictive model of their behavior. The combination of an optimization subroutine within a likelihood-based predictive model distinguishes our approach from other inverse optimization methods.
There are numerous directions that could be explored in future work. In this paper, we base our inference on the 2nd acquisition conditional on the first, but in theory all of a subject's decisions (acquisitions) in a given sample path could be leveraged when making this inference. How to incorporate all of this information into the inference on the acquisition function is a promising area of further study. In addition, our work focuses on a highly customized Bayesian optimization framework. A more general treatment of the inverse Bayesian optimization problem (including GP surrogates, higher dimensionality, a richer class of acquisition functions, etc.) represents a rich and challenging area of future work.
Finally, our study supports findings in human cognition. Specifically, we find that subjects exhibit a wide array of acquisition preferences, but that nearly all of them exhibit exploration tendencies beyond the ability of standard acquisition functions to capture. As in Wu et al. (2018), additional exploratory modifications must be included in the acquisition model in order to accurately represent the observed human behavior. Consistent with Borji and Itti (2013), Wu et al. (2018), and Candelieri et al. (2020, we find that UCB acquisition functions best represent the search strategies of the majority of subjects in our study. Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., and Meder, B. (2018). "Generalization guides human exploration in vast decision spaces." Nature human behaviour , 2(12): 915-924. 3, 22 Figure 4 of the main text, only here we plot the move 3 acquisition surfaces instead of the move 2 surfaces. In each case, move 2 was selected at an optimum of the previous iteration of the process. The green stars denote the arg max of each surface, while the pink dots show the true direction of the hotspot. In every case, the green stars are nearly identical to the pink dots.
While it is possible to learn the direction of the hotspot after move 2, whether subjects actually do learn the optimal direction is a separate question. Figure 12 shows directional histograms of subject behavior on moves 1, 2, and 3 relative to the direction of the hotspot (aggregated over all subjects). On move 1, the angles are uniform on the circle because subjects have no directional information about the hotspot on this move. The move 2 angles are directionally informed, but subjects must move off the axis of the informed direction in order to gain information about the orthogonal gradient. This can be seen in the move 2 histogram by the modes over ±π/4.   Figure 12: Histograms of subjects' move angles relative to hotspot direction on moves 1, 2, and 3, aggregated over all subjects and rounds.
The move 3 histogram sheds light on how well subjects synthesize the information from moves 1 and 2 in order to learn the direction of the hotspot. As shown by the variance around 0 in the move 3 histogram, subjects do not always synthesize this information perfectly, but the mode angle is toward the direction of the hotspot. On moves 4 through 10 (not shown), the mode remains centered around the direction of the hotspot while the variance around the mode decreases slightly on each subsequent move. Since subject behavior essentially amounts to a noisy walk around the optimal direction after move 2, moves 3 and up provide comparatively little information about the subjects' acquisition preferences. For this reason we do not analyze these moves in our paper.
These parameters were not observed to affect the coverage values of the acquisition function parameters, thus we set them arbitrarily at reasonable values within the range of the parameter estimates for the subjects in our study.

C.2 Inference class contains the truth
After generating the data for each acquisition function/n obs combination, we fit the data to the classes of acquisition functions defined in (3.10)-(3.12) following the procedure outlined in Section 4.1 of the main text. Table 3 shows the acquisition function coverage values based on 95% credible regions of the corresponding posterior distribution over 1000 simulations. Each cell represents the proportion of 95% credible acquisition functions regions that covered the true function that generated the data.

C.3 Inference class omits the truth
We next explored how our method performs when the true generating acquisition function class is omitted from the inference acquisition function class. For this study, we likewise fit the data to the classes of acquisition functions defined in (3.10)-(3.12) following the procedure outlined in Section 4.1, only we omitted the acquisition family that was actually used to generate the data when performing the inference. Table 4 shows the proportion of 95% credible regions (CR) from the resulting inference that contain parameter values from the two acquisition families that were not used to generate the data (over 1000 simulations). Note that the two coverage values on the same row of a given cell do not have to sum to 1.
Acq. ξ/p value n obs = 10 n obs = 100 n obs = 1000  Table 4: Proportion of 95% credible regions (CR) that contain parameter values from the acquisition families that were not used to generate the data (over 1000 simulations). The acquisition family used to generate the data is shown in the left hand column and the number of observations used in the inference is shown in the top row. For each simulation, the acquisition family used to generate the data was not included in the inference class.
When the inference is based on a small amount of data (n obs = 10) the resulting credible regions tend to contain acquistion functions in both of the misspecified classes. As the number of observations increases, the credible regions tend to converge to a single acquisition family, which presumably is closer to the true shape of the omitted generating function. No single family appears to dominate the competing family for every simulation setting. Subject 28 move 2 data Figure 13: Move 2 (x, y) pairs for three subjects in the experiment. On each plot, the large black dots labeled 0 and 1 show the starting location and first move location, respectively (the first move is shown along the x-axis in order to illustrate trends in the move 2 behavior). The smaller colored dots show the locations of the subjects' second moves relative to their first moves. The color gradient denotes ∆r 1 .  Figure 14: (∆r 1 , θ 2 ) pairs for each subject in the study, overlaid with the subject's MAP augmented acquisition curve, 95% HPD credible regions of the acquisition curve (dark-gray), and 95% HPD prediction regions (light-gray). Green denotes PI and red denotes UCB.