A Symmetric Prior for Multinomial Probit Models

Under standard prior distributions, fitted probabilities from Bayesian multinomial probit models can depend strongly on the choice of a base category, which is used to identify the model. This paper proposes a novel identification strategy and prior distribution for the model parameters that makes the prior symmetric with respect to relabeling the outcome categories. Further, our new prior permits an efficient Gibbs algorithm that samples rank-deficient covariance matrices without resorting to Metropolis-Hastings updates.


Introduction
In multinomial probit (MNP) models of discrete choices, parameters are typically identified by selecting a base category relative to which the choice parameters are defined. From the point of view of identification, the choice of base category is immaterial. However, in a Bayesian framework, previously developed priors can be sensitive to the base category specification -sometimes strongly so. Hence, when practitioners choose a base category in the MNP analysis, they are unwittingly making a decision about the prior specification for their model.
In this paper, we propose sum-to-zero restrictions on the latent utilities and regression parameters that define the MNP model. In this novel identification framework, we are able to define a prior which is symmetric with respect to relabeling of the outcome categories.
Even so, this model preserves the favorable computational aspects of other, recent Bayesian MNP models (Imai and van Dyk, 2005a;Burgette and Nordheim, 2012;Jiao and van Dyk, 2015).

Multinomial probit models of discrete choice
Multinomial probit (MNP) models are popular in studies involving discrete choice data (McFadden, 1974;Train, 2003). They have applications in marketing (Rossi et al., 2005), politics (Rudolph, 2003), transportation studies (McFadden, 1974;Garrido and Mahmassani, 2000), and beyond. The MNP is more flexible than standard multinomial logit models, as it need not make an assumption of independence of irrelevant alternatives (IIA). This means that the ratio of selection probabilities for two outcome categories can depend on the characteristics of another category. Further contributing to the popularity of the MNP is a series of advances in Bayesian computation, starting with Albert and Chib (1993), that has made it increasingly computationally manageable (McCulloch and Rossi, 1994;McCulloch et al., 2000;Imai and van Dyk, 2005a,b).
The MNP requires two normalizations in order to identify the model. These models can be derived through the assumption that agents construct latent Gaussian utilities and select the category that corresponds to the largest utility. Since the ordering of the utilities is maintained by an additive shift or multiplicative rescaling, identifying assumptions on the scale and location are needed.
In order to set the scale, it has been standard to fix an element on the main diagonal of the covariance matrix at one. Burgette and Nordheim (2012) demonstrated that the choice of which element one fixed could have a meaningful impact on posterior predictions, when using the popular prior of Imai and van Dyk (2005a). To avoid this problem, they proposed a model that identifies the scale of the model by fixing the trace of the covariance matrix, which makes the prior covariance invariant to joint permutations of the rows and columns. This paper will build upon such a trace-restricted prior, resolving the location identification issue as well.
Previous MNP models have set the location of the latent utilities by specifying a base (or reference) category for the model. The base category's utility is then subtracted from all of the other utilities for each observation, removing the indeterminacy of the location. But, Burgette and Nordheim (2012) noted that Bayesian MNP predictions can be sensitive to the specification of the base category, though they did not provide a satisfactory solution for this issue. This problem arises because instead of specifying a prior for the original utilities and inducing a prior on the base-subtracted utilities, it has been standard to specify a prior directly on base-subtracted utilities.
Rather than selecting a reference category whose utility is assumed to be equal to zero, we enforce a sum-to-zero restriction on the latent utilities. If respondents choose from p categories, other MNP methods transform the utilities to (p − 1)-space. Instead, we constrain our utilities to exist in a (p − 1)-dimensional hyperplane in p-space.
We apply our new prior to two consumer choice datasets, as well as a series of simulated datasets based on the consumer choice studies. In doing so, we see that the symmetric MNP (sMNP) model defines a more sensible model, produces better predictions, and has favorable computational properties compared to previous MNP models.

Preliminaries
Assume that agent i = 1, . . . , n is choosing among p mutually exclusive alternatives. The MNP can be derived by assuming that there exist vectors of latent Gaussian utilities W i = {w ij } of length p, and that each agent selects the alternative with the highest utility, so that we observe Y i = arg max j w ij .
It is standard to assume that the utilities take the form (1) X i is a matrix of covariates, β is a vector of regression parameters, and ε i iid ∼ normal(0, Σ) capture variations in taste across agents. We will assume X i contains intercept terms, k d covariates that vary by decision-maker (e.g., a buyer's age), and k a alternative-specific covariates (e.g., product prices). We assume the covariates are arranged in that order (from left to right) so that The k d -vector x i,d is the collection of covariates that vary by individual; x i,a is a p × k a matrix whose columns contain the values of the variables that vary by alternative.
The standard identifying approach is to transform with J p−1 a column vector of ones with length p−1. Then we would assume and ε * i iid ∼ normal(0, Σ * = T bc ΣT bc ). Albert and Chib (1993) had the key insight that data augmentation (Tanner and Wong, 1987) would greatly ease the estimation of the MNP. If we treat the latent W * i as parame-ters to be updated in the MCMC algorithm, then under a normal prior, the full conditional distribution of β * is normal. Further, the full conditional distribution of each W * i is truncated multivariate normal, which can be updated one component at a time as univariate truncated normals (McCulloch and Rossi, 1994).
It then remains to sample Σ * , the (p−1)-dimensional covariance over the base-subtracted utilities. Up to a constraint and the normalizing constant, the priors for both the Imai and van Dyk and the Burgette and Nordheim models are the same: where 1{cond} is equal to one if {cond} is a true statement, and zero otherwise. For Imai and van Dyk, this condition is {σ * 11 = 1}; for Burgette and Nordheim the condition is {tr(Σ * ) = (p − 1)}. Further, the working parameter is given the joint prior after which Σ * can be sampled in a Gibbs step through a draw ofΣ = α 2 Σ * .

Asymmetries of currently-used MNP priors
Later in this paper, we will demonstrate empirically that switching from one base category to another can result in substantial differences in estimated (posterior) purchase probabilities in marketing applications that appear elsewhere in the literature. To motivate this work, we begin by highlighting differences in the prior purchase probabilities under two base category specifications, conditional on a range of values of the structural portion of the utilities, X * i β * . In what follows, we consider a simple case with p = 3 categories and focus on one of the three outcome categories, which we will refer to as the category of interest.
First, we specify this category of interest as the base category (corresponding to Y i = 0), and then we reparametrize so that the category of interest is the first nonbase category (corresponding to Y i = 1). Our experience indicates that sensitivity to the base category primarily comes from the prior on Σ * (rather than β * ), so we will condition on β * in order to clarify the issue.  Figure 1: The left-hand panel displays density histograms of the probability for the "outcome of interest" when it is coded as the base category (solid) and not the base category (dashed) for a particular value of β. More precisely, it is the density histograms of ϕ j (1; Σ * )p(Σ * ) for j = 0, 1, where j = 0 corresponds to the solid curve, and j = 1 corresponds to the dashed curve. (See (7) -(9).) The right-hand panel plots ψ j (v) across a range of v. That is to say, the average values of the distributions in the left-hand panel correspond to the values in the right-hand panel at v = 1.
To simplify notation, we define In (9), p(Σ * ) refers to the trace-restricted variant of the Imai and van Dyk prior for Σ * with ν = 2 degrees of freedom, and centered at S = .5J 2 J 2 + .5I 2 . These hyperparameters correspond to a default prior for a model of p = 3 outcome categories. (Using S = I 2 would result in stronger asymmetries with respect to the base category.) Figure 1 compares ϕ j and ψ j for j = 0, 1. From the left-hand panel, note that there are strong differences in the range of probabilities for the outcome of interest that are supported by the prior for Σ * , after conditioning on β * . In particular, the distribution probabilities for the base category (solid curve) has a very sharp mode, and is less diffuse in general relative to distribution for the nonbase category (broken curve). On the other hand, the curves in the right-hand figure nearly coincide with one another. This indicates that differences between the two parametrizations in the prior are obscured by marginalizing over the distribution of Σ * . However, we note that these curves often times do not coincide after conditioning on observed data, as will be shown in Figures 2 and 6.
Because the differences in probabilities appear primarily to be of second and higher moments, an ad-hoc solution to the problem of base category dependence (such as spec- ifying alternative values of the hyperparameters, or by specifying a different p(β * |Σ * ) to compensate) may be difficult. Although we expect the impact of the prior to fade as the sample size increases, information in multinomial models accrues slowly relative to standard models of a continuous outcome, which means that asymmetries in the prior for an MNP model may persist in the posterior for sample sizes that are typical in business and economics applications. Hence, we pursue a prior that is identically invariant to relabeling the outcome categories.

A symmetric prior for MNP regressions
We now propose a symmetric MNP (sMNP) model that is invariant under relabeling or reordering of the outcome categories. Rather than identifying the locations of the latent utilities by subtracting one from the others, we instead require that they sum to zero. (This assumes that the choice-specific covariates have mean zero for each observation, which is a convenient but inessential standardization.) Further, we assume that the regression parameters that correspond to each agent-specific covariate sum to zero, which gives the same degrees of freedom as the standard MNP, where (in a sense) the regression parameters related to the base category are set equal to zero.
With this sum-to-zero restriction on the utilities, we require a covariance for W i that is symmetric and positive-semidefinite with p−1 positive eigenvalues, and constrained in some way in order to set the scale of the model. Rather than directly specifying a distribution on p × p matrices, we build it up with a mixture of trace-restricted positive-definite matrices.
Conditionally, we assume that a positive-definite matrix of dimension p − 1 describes the covariance of all but one of the dimensions of W i . We denote the left-out category with the parameter b, and refer to it as the faux base category indicator. In contrast to previous MNP models, b is learned according to Bayes rule.
The proposed model is as follows: In this formulation, p TR is the trace-restricted variant of the Imai and van Dyk (2005a) prior in (5). Its hyperparameters S b and ν b may change with b but we recommend using common hyperparameters in most cases, since S b = diag{(1 + c, . . . , 1 + c)} − cJJ for all b and a common ν b will yield a prior covariance structure that is symmetric with respect to the outcome categories. As a default, we recommend using c = 1/(p − 1). This corresponds to the first p − 1 rows and columns of a symmetric p × p covariance matrix P with p − 1 positive eigenvectors that is symmetric with respect to relabeling the rows and columns.
This matrix has the property that vectors drawn from the normal(0, P ) distribution sum to zero almost surely, which is a natural center for our relabeling invariant, sum-to-zero MNP.
Using c = 0 means roughly that we expect p − 1 of the dimensions of the utilities to be independent, with the remaining dimension strongly anti-correlated. Using c = 1/(p − 1) is a more neutral prior, and seems to lead to better mixing in the MCMC.
. The function f acts on β b such that for each sub-vector of length p − 1 that corresponds to an agent-specific covariate (or the intercepts), β is equal to β b with an extra dimension inserted at the bth position in the sub-vector. This inserted element is chosen so that the sub-vector sums to zero.
With this model specification, we induce a prior distribution on the set of positivesemidefinite matrices of dimension p that have exactly p − 1 positive eigenvalues. It would also be possible to work with a matrix decomposition like Σ = ADA , where A is a p×(p−1) orthogonal matrix and D is diagonal. One could then define a prior on the Stiefel manifold that contains A (Hoff, 2009). This would be a more direct definition on positive semidefinite matrices, but inducing a prior in the manner implied by our model is conceptually simple and guarantees favorable computational properties.
To make the motivation of this new set of identifying restrictions explicit, we note that they result from transforming the unnormalized utilities not by T bc as in (3), but rather multiplying them by a p-dimensional square matrix T s that is defined to have ones on the main diagonal, and entries of −1/(p − 1) elsewhere. Note that arg max W i = arg max T s W i , while the elements of T s W i sum to zero. This transformation also induces the proposed identifying restrictions on β. If we partition β = (β d , β a ), where β a corresponds to the covariates that vary by outcome category, we have This transformed version of β (i.e., the second factor on the right-hand side of the above equation) conforms to the proposed identifying restrictions. Similarly, a normal distribution with mean zero and covariance T s ΣT s results in draws that sum to zero almost surely. (Note that T s is almost idempotent in the sense that T s T s = cT s for some scalar c. The first p − 1 rows and columns of T s therefore serve as our default for S b since this corresponds to the transformed variance of ε i if its variance in the unnormalized scale is proportional to the identity.) We emphasize that there is nothing inherently wrong with using the asymmetric identifying transformation T bc . If we do not wish for our inferences to depend on the base category, however, the prior must compensate for the asymmetries in the transformation. This seems quite difficult to achieve, especially if we hope to have a computationally tractable model.
Using T s , however, we can decouple prior specification and model identification, all while preserving the favorable computational characteristics of existing MNP models.

Model estimation
We propose a Gibbs sampler to estimate the model by constructing a Markov chain on a transformed space: . By explicitly working in the (α, Σ b , b,W,β b ) parametrization in specifying the Gibbs sampler we avoid the mistakes discussed in Jiao and van Dyk (2015), although our algorithm is different than theirs.
Remark. Note that at every iteration in the Markov chain, Σ b is restricted to satisfy the identifying trace restriction.
Let X i,b indicates X i with the bth row and the columns specific to the bth category removed. We initialize the latent utilities W i by sampling a standard normally-distributed vector of length p and centering it at zero. We then permute its elements so that the maximum of each W i coincides with the observed Y i .
The sampler proceeds in three steps: Remark. Note that all variables referenced in these Gibbs steps are from the same parameterization: (α, Σ b , b,W,β b ).
Next, we consider each step in turn.
In the original parametrization we have By definition,W = αW and I(W, Y ) is the indicator that the utilities and data match correctly, so with α known at this stage of the Gibbs sampler, we write the conditional distribution as: To sampleW , we iterate one-by-one through the elements ofW i,b . Note thatw i,b is known given b andW i,b . After dropping the bth element ofW i and the corresponding elements in X i and β, the full conditionals of elements ofW i,b are truncated univariate normal. The conditional means and variances can be calculated as described by McCulloch and Rossi (1994), usingβ b as the coefficient vector and α 2 Σ b as the covariance. The truncations are:

This update follows from the fact that if
Remark. Having obtained samples from the Markov chain defined over (α, Σ b , b,W,β b ), one can transform, per-iteration, back to the original space to obtain samples of W =W/α and β b =β b /α. To interpret the β parameters, we know -by the sum-to-zero property of the intercept terms -that a brand with an intercept coefficient that is persistently negative (All) is less desirable than average, in a sense (Figure 4). EraPlus and Tide are estimated to be more desirable. However, note that these intercepts do not reflect marginal purchase probabilities, as less desirable brands may also have lower prices. As economic theory would suggest, the price coefficient is strongly negative (Figure 5), which indicates that raising a detergent's price (relative to the competitors) will lower its estimated purchase probability.

Clothes detergent purchases
Although these interpretations of the β parameters are accurate, we would argue that summaries of MNP results are best phrased in terms of changes in posterior predicted selection probabilities. For example, one might consider the effect of a proposed price increase on the current purchase probabilities. We advocate this because predictions take into account both β and Σ parameters, and the Σ parameters can be very difficult to interpret on their own. If only the β parameters are of interest in an application, we would argue that a model that assumes IIA may be more appropriate.

Margarine purchases
We also consider a similar analysis of consumer purchases of margarine that are available in the bayesm package in R. Again, our model only has intercepts and a price coefficient.
Following McCulloch and Rossi (1994), we limit our analysis to purchases of Parkay, Blue Bonnet, Fleischmanns, House brand, Generic, and Shedd Spread tub margarines.
And, following Burgette and Nordheim (2012), we limit the analysis to the first purchase of one of these brands for each household. This results in a dataset with 507 observations.
With the smaller sample size, there are larger differences in posterior estimated purchase probabilities when one switches from one base category to another in standard MNP fits. In Figure 6, we see that sMNP predictions again tend to be between those of standard MNP models when we consider all possible base categories, as was the case in Figure 2.
The observed House brand prices are between $0.19 and $0.64, so there is significant disagreement across nearly the entire range of observed prices for that brand. (With the larger sample size in the detergent data, we only saw meaningful differences when we extrapolated out of the observed price range.) Although there is some Monte Carlo error in the estimates, it is insignificant compared to the 19% difference between the low and high estimates of House's selection probability when it is priced at $0.20.
Thus, in both of these examples, we see that the sMNP gives predictions that are between those of the standard MNP models that are fit alternately with each base. This is compatible with the heuristic interpretation of the sMNP as a model that averages across base categories in standard MNP models.
An alternative approach to handling dependence on the base category would be to fit an Imai and van Dyk-style MNP model using each base category separately, and perform a post-hoc average of the fitted probabilities. We find this to be unappealing from several perspectives. First, the computation load is p times as large as it would be for a single, standard MNP fit; the sMNP is only slightly more expensive than a single base category MNP.
More importantly, the sMNP constitutes a proper Bayesian procedure, which automatically incorporates posterior uncertainty about the base category and uses a likelihood-weighted average of the possible models (bottom panel of Figure 3).

A simulation study
Here we compare the fitted probabilities of MNP models that use each of the possible base categories and the fitted probabilities that result from the base category-free sMNP. We simulate 50 datasets that are loosely based on the consumer choice examples above. We assume that n = 750 consumers are choosing from p = 6 products. The simulated productspecific intercepts and mean prices have correlation 0.9 so that more desirable products are more expensive, as one would expect. The price coefficient was drawn uniformly from [−1.25, −.75] so that if a product is relatively less expensive, it will be more popular. Finally, a p × p covariance matrix with expectation I is drawn from an inverse-Wishart distribution with 50 degrees of freedom. The simulation parameters were chosen so that each "brand" is chosen with high probability. Note that the data parameters were chosen without regard to any set of identifying restrictions.
We measure performance via the total variation between the estimated and true purchase probabilities, averaged over the first 10 sets of prices in each simulated dataset. We expect that the sMNP will be less prone to making "extreme" predictions in the sense of Figure   6. The results are summarized in Figure 7, and are consistent with this notion. The plot gives the average total variation from the true purchase probabilities for each of the base category MNP models (hollow circles) and the sMNP (solid circles). Note that the sMNP is never the worst among the various base category models. In 18 of the 50 simulated cases, sMNP outperformed all of the base category models. In 41 out of 50 of the simulations, the sMNP performed better than the median base category performance.  Figure 7: Simulation results. Points give the average percent total variation between true and estimated purchase probabilities. Solid black circles are from the sMNP. Hollow gray circles are from MNP models that use each of the six possible base category identifying restrictions. The sMNP is almost never worse than every base category model, only 1 out of 50 cases, and in 41 out of 50 cases it beats the median performing base category.

Repetition
A potential downside to our model is that it is not formally identified. In particular, the model would be identified if we were able to restrict the trace of Σ, rather than the trace of Σ b . If one of the diagonal elements of Σ is estimated to be substantially larger than 1, then the scale of β will depend on b. Although a fully identified model may be preferable, we argue that little is lost in this case.
First -from the perspective of prior specification/elicitiation -the model is identified conditional on the discrete parameter b. If the analyst wishes to specify an informative prior, this can be done conditionally for each b = 1, . . . , p. If the model were only identified conditional on a continuous working parameter, this process becomes more difficult. Second -on the side of interpretation -we would argue that β parameters should be interpreted while taking Σ into account, and vice versa. Since marginal summaries do not do this, we feel that the best model summaries are changes of fitted probabilities as a function of key outcome variables such as in Figure 6, which are not impacted by this identification issue. If the analyst truly is interested in features of the marginal posterior distribution of β or Σ, it is possible to post-process the results into a single, identified scale by re-scaling the sampled values at each iteration of the MCMC such that, for example, the trace of Σ is equal to p. However, the signs of the estimated β parameters are not impacted by the under-identification of our model.
Post-processing in order to identify Bayesian MNP models was popularized by McCulloch et al. (2000), in the context of specifying a prior forΣ * , rather than the identified Σ * .
As an aside, we note that a related idea for solving the base category problem would be to specify a full-rank inverse-Wishart prior for Σ, without worrying about the conditional identifying restriction on the location of the W i . However, this approach proves to be numerically unusable. The p-dimensional inverse-Wishart prior pushes the sampled values of Σ toward the edge of the parameter space, which quickly results in numerical problems that result from sampling poorly-conditioned covariance matrices.

Conclusion
The analyses in this paper demonstrate that careful handling of the prior is necessary in order to obtain reliable predictions from the Bayesian MNP. As with any proper Bayesian model, our estimates are biased, but they are not biased against any particular outcome category in the prior. The same can not be said of previous MNP models that estimate the covariance of the utilities.
With the prior for the regression coefficients centered on zero, the sMNP estimates should be pulled toward more moderate estimates. Since multinomial data are quite coarse (in the sense that each observation contributes little information compared to a multivariate normal regression where the utilities are observed) we would argue that this prior-induced regularization toward moderate predictions is highly desirable.
When building more advanced MNP models, symmetry may take on even greater importance. For example, Cripps et al. (2010) proposed an MNP model that allows for a sparse representation of the precision matrix of the latent utilities. However, they induce sparsity in the precision of the base-subtracted utilities, not in the precision of the original utilities. This seems very likely to exacerbate the problem of posterior estimates changing across different specifications of the base category. Further, it is unclear that sparsity in the base-subtracted precision corresponds to a meaningful data-generating process. That said, it is likely that favorable bias/variance tradeoffs can be made by specifying a prior that pulls the precision toward a well-chosen, sparse structure.
More broadly, the regularizing effect of a Bayesian prior distribution is at its most powerful when the likelihood is poorly behaved in some way: when it is flat or spiky; when identification is weak; when the number of parameters is large relative to the sample size.
However, in each of these situations, we should be worried that if our prior has undesirable features, they may be preserved in the posterior. For example, MNP likelihoods can be quite flat, and therefore the asymmetry of previously-proposed priors can propagate to the posterior. Data analysts may hope that such undesirable features of the prior would be overwhelmed by the likelihood. This research suggests that while we cannot always count on the data to cover flaws of our priors, we may be able to design priors that lack the flaw in the first place, without giving up computational tractability.

Appendix: Proof
Conditional distribution of (b,Σ b ) Computationally, the major change from the algorithm of Burgette and Nordheim (2012) is the draw from (b,Σ b ). From the full conditional, we have from the inverse-Wishart density. Conditional on b,Σ b can be sampled from an inverse-Wishart distribution as described by Imai and van Dyk (2005a).