Informative Goodness-of-Fit for Multivariate Distributions

This article introduces an informative goodness-of-fit (iGOF) approach to study multivariate distributions. When the null model is rejected, iGOF allows us to identify the underlying sources of mismodeling and naturally equips practitioners with additional insights on the nature of the deviations from the true distribution. The informative character of the procedure is achieved by exploiting smooth tests and random fields theory to facilitate the analysis of multivariate data. Simulation studies show that iGOF enjoys high power for different types of alternatives. The methods presented here directly address the problem of background mismodeling arising in physics and astronomy. It is in these areas that the motivation of this work is rooted.


Introduction
Scientific motivations. When searching for the signals of new particles, or when aiming to detect new astronomical objects, a common difficulty arising in the analysis of the data collected by the detectors is the impossibility of correctly specifying the background distribution. In physics and astronomy, we typically refer with "background" or "noise" to the signal of all the astrophysical sources which are not those we aim to discover.
Moreover, if the model postulated by the scientists is rejected, it is often difficult to identify the invalidating causes. For instance, instrumental errors may lead to unexpected perturbations in the data distribution, or there may be unpredicted cosmic sources with non-negligible contributions. Moreover, given the complexity of the models investigated through physics experiments, it is often convenient to consider simplified versions of them (typically Gaussian approximations, e.g., Balázs et al. (2017)). Hence, it is particularly important to assess the reliability of the simplified models for the data available and, if needed, provide adequate adjustments for them.
Statistical formulation of the problem. In statistical terms, these difficulties translate into two main questions arising in the statistical analysis of multivariate data.
Specifically, given a random vector X = (X 1 , . . . , X p ), we may wonder: Q1. is the distribution of X correctly specified and, if not, in what way does the true data distribution diverge from that hypothesized under the null hypothesis?
Q2. How can we improve our postulated model? Or in other words, can we provide a data-driven correction for it?
As noted by Pearson (1938), smooth tests, originally introduced by Neyman (1937), naturally allow us to capture and model the departure of f from g and thus, they offer the framework to directly address Q1 and Q2.
In order to provide a high level overview on smooth tests, let f be the true (unknown) probability density function (pdf) of a random variable X ∈ R, g is the hypothesized density and G the respective cumulative distribution function (cdf). For example, in the above-mentioned problem of background mismodeling, f represents the true background distribution and g is the background model postulated by the scientists. A smooth model for the true probability law f can be specified as where d(x) = f (x) g(x) is the likelihood ratio and the term in the curly brackets is an orthonormal expansion for it. A smooth test (e.g., Neyman, 1937;Barton, 1953;Ledwina, 1994) consists of testing if any of the coefficients θ j in (1) is different from zero. Finally, by estimating d(x) and constructing adequate confidence bands, it is possible to visualize the nature of the departure of f from g.
Despite their usefulness, smooth tests are mainly limited to the univariate setting. In light of this, the main methodological task of this work is to extend this framework to allow for the analysis of multivariate data.
Main results and organization. The theoretical framework is presented in Section 2. There, we define a suitable expansion of the likelihood ratio through orthonormal functions on the unit cube. As shown in Sections 3 and 4, such representation substantially simplifies the subsequent stages of estimation, model selection and (post-selection) inference. In Section 5, we discuss a simple ANOVA-like testing strategy to identify possible sources of mismodeling. Power studies are conducted via simulations in both Sections 4 and 5. As noted above, this work finds its main motivations in the context of astrophysical searches. Therefore, in Section 6 we illustrate how iGOF can be used to address the problem of mismodeling of the cosmic background considering a realistic simulation from the Fermi Large Area Telescope (Atwood et al., 2009). While this article mainly focuses on the analysis of continuous data, extensions to the discrete setting are discussed in Section 7. Section 8 collects a summary of the results and a discussion of the limitations of iGOF. Technical proofs and codes are provided in the Supplementary Material. A summary of the main notation used throughout the paper is available in the Appendix.

An orthonormal expansion for the likelihood ratio
Suppose F is the true distribution function of a random vector X ∈ X ⊆ R p and denote with G its hypothesized distribution. F and G are assumed to be continuous with densities f and g. Furthermore, assume that f (x) = 0 whenever g(x) = 0. For every x = (x 1 , . . . , x p ) ∈ X , the hypothesized density g is such that where x <d = (x 1 , . . . , x d−1 ) and g 1 , . . . , g p are suitable densities with associated cdfs and quantile functions G d and G −1 d , for all d = 1, . . . , p. The likelihood ratio between F and G can be specified as (Rosenblatt, 1952) In the bivariate setting, for instance, let G 1 ≡ G X 1 and G 2 ≡ G X 2 |X 1 , i.e., the hypothesized marginal cdf of X 1 and the hypothesized conditional cdf of X 2 |X 1 , respectively.
Hence, (2) specifies as . Remark 2.1. As a plausible alternative to Rosenblatt's transform, one could choose each G d ≡ G X d , which corresponds to assuming independence among the components of X. In this setting, if the marginal distributions are correctly specified, (2) is the copula density (e.g., Nelsen, 2007) of X under G. While this choice could simplify substantially the computations, it would not allow us to test models G which assume a specific dependence structure and the interest is in assessing if the joint distribution G is misspecified. Moreover, it is worth pointing out that there are situations where such transformation cannot be specified (e.g., Section 6).
To provide a sufficiently detailed representation of the substructures characterizing the distribution of X (see Q1 in Section 1), a natural approach is that of expressing (2) by means of a suitable orthonormal basis in forms an orthonormal basis on L 2 [0, 1] p , the Hilbert space of square integrable function over the p-dimensional unit cube.
Notice that while any orthonormal basis in [0, 1] could be used to construct a tensor product basis in [0, 1] p , here we focus on the normalized shifted Legendre polynomials.
This choice is justified by the fact that the latter are special cases of the so called LPscore functions (e.g., Mukhopadhyay and Wang, 2020). As discussed in Section 7, the latter allow for extensions to the discrete setting.
Finally, under the assumption that d(u) ∈ L 2 [0, 1] p , we can write The expansion in (4) follows from Theorem II.6 in Reed and Simon (1980) and it is equivalent to say that the sum on the right-hand side As noted by an anonymous referee, the likelihood ratio can also be expanded on the original domain X ⊆ R p by means of any set of bounded functions which are orthogonal with respect to G. In our context, this is achieved by combining the Legendre polynomials and Rosemblatt's transform. The latter provides the additional advantage of allowing us to work on the compact compact domain [0, 1] p . As it will become clear in Section 4, this is particularly useful as one can exploit results from random field theory to construct simultaneous confidence bands. Moreover, for visualization purposes, it may be particularly advantageous to work in the quantile domain when testing long tailed distributions. In this setting, we may expect that only a few observations have been detected over large regions of the X domain and thus the quantile representation allows us to magnify the differences observed over the the most "data-abundant" regions.
An more detailed discussion of this aspect, and adequate graphical comparisons can be found in Algeri (2020). Finally, it is worth pointing out that, the quantile functions G −1 d in G −1 R (u) are used in (2) with the only purpose of highlighting the dependence of d on u, when working in the quantile domain. In practice, however, estimation and inference focus entirely on (4) (see Sections 3 and 4) and thus one needs not to compute G −1 R (u).

Estimation
The summations in (4) are taken up to infinity. However, to make the expansion operational, it is necessary to truncate the series in (4) at integers values m 1 , . . . , m p . That is because, effectively, the coefficients θ j 1 ...jp need to be estimated and, consequently, the more terms are included in (4), the larger the variance of the resulting estimator of d(u) (see Section 4.2 for a more detailed discussion on model selection).
For the sake of simplifying the notation in this section and those to follow, denote with K the set Consider x 1 , . . . , x n , a sample of n i.i.d. observations from X, and let U = G R (X) be the respective Rosenblatt transformation. Denote with u 1 , . . . , u n the sample of elements The parameter θ can be estimated by means of the vector θ of components The mean and covariance matrix of θ and an estimator of d(u) are given in Proposition 3.1.
Proposition 3.1. The likelihood ratio d u is the density of the random vector U and where Σ has diagonal elements σ 2 k n = 1 n V T k (U ) and non-diagonal elements where 0 is the M × 1 zero vector and I M is the M × M identity matrix.
Finally, an estimator of d(u) is and has variance V d(u) = T (u) ΣT (u).
Combining (1), (2) and (9) an estimate of f is Notice that the estimator f incorporates the information carried by the hypothesized model g; whereas, the estimator in the square brackets provides a data-driven correction for it. Furthermore, define the integrated squared bias (ISB) of d(u) to be From Proposition 3.2 it follows that the closer g is to f in terms of squared normalized distance the lower the ISB of d(u).
Proposition 3.2. The integrated squared bias of the estimator in (9) is where 1 is the M × 1 unit vector.
The estimate in (10) is essentially that of a smooth model (e.g., Rayner and Best, 1990), that is, a smoothed version of the true underlying probability function. Similarly  to the smooth model proposed by Barton (1953) in the univariate setting, the estimator in (10) may lead to estimate that are not bona-fide, i.e, they may be negative and/or they may not integrate/sum up to one. In this manuscript we focus on (10) mostly for the sake of mathematical convenience when constructing simultaneous confidence bands in Section 4. Nonetheless, bona-fide estimators can be constructed similarly to the univariate case as described in Algeri and Zhang (2020).
Example I. In direct searches for dark matter, the dominant background sources are neutron recoils which may produce signals mimicking those expected from dark matter candidates (e.g., Westerdale, 2016). As a toy example, suppose we are interested in assessing the validity of a given distribution for the nuclear recoil background specified over the energy region X = [5, 20]KeV nr × [0, 17]KeV nr. Each observations in X corresponds to the scintillation of photons (X 1 ) and ionization electrons (X 2 ) (e.g., Aprile et al., 2017). The hypothesized background distribution, G X 1 X 1 , is that of a truncated bivariate normal with mean vector (12, 8), variances 8 and 12 and covariance 2. Moreover, suppose that one additional background source is present. The latter is also a bivariate normal with the same mean vector, variances 4 and 20 and covariance 5. Thus, the true model, F X 1 X 2 , involves a mixture of two, overlapping truncated bivariate Gaussians with mixture parameter 0.15. In order to estimate the likelihood ratio, set G 1 = G X 1 (x 1 ) and G 2 = G X 2 |X 1 . The estimated likelihood ratio, obtained over a sample of n = 5, 000, is shown in the right panel of Figure 1, whereas the left panel shows the true likelihood ratio. A closed form expression for the estimate shown on the right panel is given in equation (S.34) in the Supplementary Material. The estimate obtained recovers the main departures from uniformity. Specifically, the second mixture component contributing to F X 1 X 2 leads the an inflation of the variance of X 2 ; as a result, F X 1 X 2 exhibits higher tails than G X 1 X 2 in the direction of the first eigenvector. Such departure becomes more and more prominent when moving from the center of the distribution towards the truncation points X 2 = 0 and X 2 = 17. Finally, the contours of d(u) highlight that the estimator is rather noisy. Therefore, it is important to investigate the properties of (9) to assess the significance of the deviations observed.

Pre-selection inference
A smooth test for H 0 : G ≡ F versus H 1 : G ≡ F consists in reformulating the problem as a test for uniformity of U . Specifically, (2) implies that F ≡ G whenever d(u) = 1, and thus It is easy to see that d(u) = 1 for all u ∈ [0, 1] p , when all θ k , k ∈ K, are identically equal to zero. Hence, in practice, we test Notice that H 0 in (13) implies H 0 in (14), but the opposite is not true in general.
Whereas, H 1 in (14) does imply H 1 in (13). With a little abuse of nomenclature, in this section and those to follow, we will refer to G as the "null model". Furthermore, we will refer to H 0 in (13) when generically saying "under H 0 ". However, most of the results presented here, only require validity of the milder H 0 in (14).
To conduct our inference, we consider the so-called deviance test statistics, i.e., Its asymptotic null distribution is given in Theorem 4.1.
where N (0, I) denotes a standard multivariate normal distribution. Furthermore, where M is the size of θ.
The asymptotic distribution of the random field d(u) is derived in Theorem 4.2 below.
This result is particularly useful for us to construct simultaneous confidence bands as described in Section 4.2.
Theorem 4.2. Denote with { d(u)} the random field indexed by u ∈ [0, 1] p with components as in (9). Moreover, assume that where Z(u) denotes a Gaussian random field with mean zero, unit variance and covari- .
At this stage, constructing inference on the basis of Theorems 4.1 and 4.2 would be tempting. However, to guarantee the validity of our results we must take into account that, when estimating the likelihood ratio in (9), a model selection procedure is likely to be implemented. Unfortunately, when a model is selected by a pool of possibilities, such process introduces an additional source of variability and thus the resulting inference is automatically affected (e.g., Berk et al., 2013). Section 4.2 addresses this aspect directly.

Post-selection inference
The estimate of the likelihood ratio considered so far involves up to M functions T k (u).
Nonetheless, it is possible that not all of these M terms are needed to capture the departures of G from F and indeed, it is often convenient to remove some of them to avoid unnecessary sources of noise. Various criteria have been proposed in literature for density estimation and smooth models (e.g., Mukhopadhyay, 2017;Algeri, 2020) and which can be easily extended to the multivariate setting. Here, we focus on the approach of Mukhopadhyay (2017) and which specifies as follows.
Let θ (k) be the k-th largest θ k estimate in order of magnitude, for k ∈ K, i.e., θ 2 (1) ≥ θ 2 (2) ≥ · · · ≥ θ 2 (M ) . Select the K largest coefficients which maximize either Notice that, as defined in (5) is the integer value corresponding to the order of magnitude of the respective coefficient θ k . Hence, the summations in (19) and those to follow are taken over (k) = 1, . . . , K, that is, the K p−tuples of indexes j 1 . . . j p with the K-th largest estimates θ k .
An estimate of d(u), is then selected via (19) from the family of estimators where the subscript (K) on the left-hand-side is used to emphasize that the estimator in (20) includes only the K-th largest θ k estimated coefficients. Clearly, the choice of BIC or AIC is arbitrary and, from a practical standpoint, when n > 7(≈ e 2 ), the BIC assigns a heavier penalty than AIC for increasing values of K and, consequently, it often leads to smoother estimators than AIC.
The selection rules in (19)   (k)=1 θ 2 (k) be the respective deviance statistics. As n → ∞, a valid postselection bound for the p-value to test (13) is where D obs is the value of D (K * ) observed.
Where the bound in (21), follows from the fact that the estimators in (20) are nested, for all K = 1, . . . , M − 1, and thus each In order to grasp further insights on the deviations of G from F , it is worth constructing adequate confidence bands. This can be done, while accounting for post-selection adjustments, as in Corollary 4.4.
Corollary 4.4. Denote with d (K * ) the estimator of d(u) selected via (19), and let SE 0 d (K * ) (u) be its standard error under H 0 . Moreover, assume that, for all K = 1, . . . , M −1. Valid (post-selection adjusted) simultaneous (1−α)% confidence regions, under H 0 in (13), are In (22), " " indicates that the left-hand-side is stochastically lower or equal than the right-hand-side. Intuitively, the validity of (22) in practical settings follows from the fact that d (M ) is the least smooth among all the estimators considered; thus, we expect that the random field resulting from d (M ) has the largest probability of crossing the fixed level The confidence bands in (23) are constructed around 1, not around d (K * ) (u). That is because d (K * ) (u) only accounts for the K * largest terms in (4), therefore, it is a biased estimator of d(u) (see Proposition 3.2). It follows that, when the bias is large, confidence bands constructed around d (K * ) (u) would be shifted away from the true density d(u).
However, under H 0 , both the bias at a point u and the integrated bias are equal to zero.
From a theoretical perspective, a highly non-trivial aspect in the construction of (23) is the estimation of the quantile c α/2 . Probabilities such (24) are known in literature as excursion probabilities (e.g., Adler, 2000) and cannot be expressed in closed form. A possible solution for constructing the confidence bands in (23), is to proceed by estimating SE 0 d (K * ) (u) and c α/2 via Monte Carlo simulations (see Algeri and Zhang, 2020, Algorithm 1). Unfortunately, in the most crucial (astro)physical searches the level of significance required to claim a new discovery is typically in the order of α = 10 −7 (e.g., Lyons, 2013), and thus Monte Carlo simulations may be computationally prohibitive.
This is further aggravated when dealing with complex models for which even a single Monte Carlo replicate can be highly expensive in terms of both computational and time resources.
As a valid alternative, for continuous F and G, accurate approximations for (24) under mild smoothness conditions exist (e.g., Taylor and Worsley, 2008). In our setting, smoothness follows from the fact that the random field in (18)  An approximation for the left-hand side of (24) is for some γ > 1 (Taylor et al., 2005). In (25), L 1 and L 2 are constant known as Lipischitz-Killing curvatures and are typically estimated numerically (e.g., Algeri and van Dyk, 2020). Notice that the error rate in (25) decreases exponentially fast, as α → ∞. Therefore, this solution is particularly amenable to overcome the issues arising when dealing with stringent significance requirements.
As one may expect, the simplicity of the post-selection adjustments in (21) and (23) comes with a price. Specifically, they can be rather conservative for increasing values of  (21) and (23). Darker shades correspond to significant deviations of the estimated likelihood ratio above one. Lighter shades correspond to significant deviations below one. n = 500 n = 1000 n=2000 n = 5000 n = 7000 n = 10, 000  quite accurate and match closely the confidence regions obtained by simulating directly the distribution of (18), while repeating the selection process at each replicate.
Example I (continued). The estimate of the likelihood ratio in the right panel of and estimating L 1 and L 2 by means of the R package TOHM (Algeri, 2019) as described in Algeri and van Dyk (2020). This approach led to c α = 3.5568. The confidence contours suggest that the most prominent deviations occur in correspondence of the Here, the estimator of d(u) shows significant deviations above one and thus we conclude that the postulated model underestimates the truth over these areas. The presence of significant departures of G X 1 X 2 from F X 1 X 2 are confirmed by the deviance test (adjusted p-value ∼ 3.97 · 10 −11 ). The left panel of Figure 2 shows the confidence regions and deviance p-value obtained by means of a Monte Carlo simulation involving B = 10, 000 replicates. The selection procedure has been implemented at each replicate. While more conservative, the confidence regions computed via (23) and (26), approximate reasonably well those obtained via simulation.
Finally, we investigate the probability of type I error and the power of the deviance test based on (21). Table 5 reports the results obtained considering a suite of five simulations, each of size B = 10, 000, conducted using five different sample sizes. For all n considered, the probability of type I error observed is approximately the same than the nominal level α = 0.05. Whereas, the power increases rapidly with n. For the smallest samples sizes considered, i.e., n = 500 and n = 1000, the power is rather low (∼ 22% and ∼ 45%, respectively). However, it has to be noted that, in our example, the mixture parameter is 0.15; therefore the deviations from the postulated model effectively account for only ∼ 75 and ∼ 150 data points when n = 500 and n = 1000, respectively.

iGOF-diagnostic analysis
The constructs introduced so far allow us to assess the validity of the postulated model, obtain an estimate of the likelihood ratio test to visualize where and how departures of g from f occur, and construct a data driven correction for the initial model g (equation 10). Unfortunately, however, a visual inspection is only possible when p ≤ 3. Nevertheless, when p > 3, more insights on the sources of mismodeling affecting G can be obtained by conducting an ANOVA-like analysis where random sub-vectors of X are tested individually, from the largest to the smallest.
Without loss of generality, let X q = (X 1 , . . . , X q ) be the random collecting the first q < p components of X. Denote with F q the true cdf of X q and let G q be its postulated cdf. Moreover, assume that the density of G q can be specified as As in (2) and (4), we can then express the likelihood ratio of X q on the q dimensional unit cube via d(u q ) = j 1 ≥0,...,jq≥0 where u q = G 1 (x 1 ), . . . , G q (x q |x <q ) , and thus u q is a sub-vector of u = G R (x).
Whereas, similarly to (3), one can write the tensor basis functions T j 1 ...jq as The last equality follows from the fact that T 0 G(x d |x <d ) = 1 for all d = 1, . . . , p, and thus each T j 1 ...jq (u q ) = T j 1 ...jq0...0 (u). Consequently, the θ j 1 ...jq coefficients are equal to θ j 1 ...jp whenever j d = 0, for all d = q + 1. As a result, we can easily perform inference for X q by means of the estimators θ k in (6), without the need of an entirely new estimation procedure.
Specifically, denote with K q and K * the subsets of K in (5) of cardinality |K q | = M q = q d=1 (m d + 1) − 1 and |K * | = K * . Recall that K * is the value minimizing either the AIC or BIC in (19), and thus, K * collects all the p−tuple of indexes in K which have been ultimately selected when constructing the estimator d (K * ) and the deviance statistics D (K * ) in Corollaries 4.3 and 4.4. To test we may consider the test statistics D q = n k∈Kq θ 2 k and proceed as in Theorem 4.1.
Whereas, valid post-selection inference can be obtained as in Theorem 5.1.
Theorem 5.1. As n → ∞, a valid post-selection bound for the p-value to test (32) is p-value q,adj = P (χ 2 Mq > D obs ), where D obs being the value of the test statistics observed, K q and K * as in (30) and (31) and θ k as in (6).
Theorem 5.1 follows directly from (28) and (29)  Because of condition (27), Theorem 5.1 holds only for random sub-vectors of X whose Rosenblatt transform u q includes all the conditioning, from the higher to the lower, necessary to recover g q x q . To some extent, this condition can be seen as the iGOF counterpart of the marginality principle advocated by Nelder (1977) in the context of ANOVA, and which consists in taking into account of the hierarchy of the main effects and interactions in a given model.
Similarly to the ANOVA, Theorem 5.1 allows us to construct an iGOF-diagnostic table to identify the source of mismodeling for a given random vector X and its components. Below we show how this can be done in practice for the case of a 7-dimensional random vector.
Example II. We consider a sample of n = 5000 observations from a random vector X = (X 1 , . . . , X 7 ) with components distributed as summarized in the second column of Table 2. Table 3 collects the results obtained by applying Theorem 5.1 to test the validity of the models specified for different sub-vectors of X. The overall deviance test is reported in the first row and correctly reject the null model. Similarly, the test in the second row, rejects the hypotheses that the vector (X 1 , X 2 , X 5 , X 6 ) is modelled correctly, and fails to rejects the model for (X 1 , X 2 , X 5 ).   (33) with X q specified as in the first column. The second column corresponds to the degrees of freedom used in the calculation of the p-value, namely, M q .

Xq
Sample size (n) 500 1000 2000 3000 5000 10, 000  This aspect is particularly important as it highlights that the mismodeling occurs only with respect to the conditional distribution of X 6 |X 1 , X 2 , X 5 . The tests in the fourth and fifth row show that the vector (X 3 , X 4 ) has been mismodeled and one source of mismodeling is the marginal of X 3 . Ultimately, the test for X 7 also correctly rejects the null hypothesis of Cauchy distribution. Table 4 collects the results of a simulation obtained by repeating the diagnostic analysis in Table 3 (21) and (23). Darker shades correspond to significant deviations of the estimated likelihood ratio above one. Lighter shades correspond to significant deviations below one.
whereas, the model for (X 1 , X 2 , X 5 ) is never rejected. More issues arise in diagnosing mismodeling of X 3 and, consequently, (X 3 , X 4 ) for smaller samples. For instance, even when n = 1000 the power of the procedure in detecting departures of G X 3 from F X 3 is only ∼ 56% and ∼ 3% for (X 3 , X 4 ).
It has to be noted, however, that detecting mismodeling of X 3 is a particularly challenging task. As shown in Figure 3, the postulated and the true pdf of X 3 are very close one-another; this minor differences are further "diluted" when considering the joint distribution of (X 3 , X 4 ), since X 4 |X 3 is correctly specified. Nevertheless, such minor deviations are detected with high power for larger sample sizes. The region of interest corresponds to a disc in the sky of 30 • radius and centered at (195 RA,28 DEC), where RA and DEC are the coordinates in the sky. Here we assume that, while the cosmic background is known to follow a uniform distribution over the search area, it is unclear if the instrumental error is effectively negligible, or if it has a prominent effect on the underlying distribution. Therefore, we set G X 1 X 2 to be the cdf of a uniform distribution with support X 1 × X 2 = [165,195]

A diagnosis of background mismodeling
and we proceed by estimating the likelihood ratio via (9) over a sample of n = 68658 observations. Specifically, we set m 1 = m 2 = 4 and we select the components of θ via the BIC criterion in (19). The resulting estimate is In order to assess the significance of the deviations captured by (35), we compute simultaneous confidence regions and deviance p-values via (23) and (21)

Extensions to the discrete case
The methods discussed so far focus on the case where F and G are continuous. However, extensions to the discrete setting can be derived by rewriting the expansion in (4) through an orthonormal set of functions suitable to model discrete data. This can be done, for instance, by means of the so-called "LP 3 score functions", recently introduced (e.g., Mukhopadhyay and Wang, 2020) and which can be seen as a generalization of the Legendre polynomials valid in both the continuous and discrete setting.
Specifically, when p = 1, a complete orthonormal basis of LP score functions in L 2 (G) can be specified by letting the first component to be T 0 G(x) = 1. Subsequent where G mid (x) = G(x) − 0.5p G (x) is the mid-distribution function, which it has been shown in Parzen (2004) to have mean 0.5 and variance [1 − x∈X p 3 G (x)]/12, with X being the set of distinct points in the support of X and p G (x) = P (X = x) if X ∼ G. Therefore, T 1 G(x) is the standardized mid-distribution and orthonormality of the T j G(x)] functions in L 2 (G) follows by the first equality in (36) and by Gram-Schmidt process.
Notice that, for continuous X, G mid (x) = G(x) and x∈X p 3 G (x) = 0, consequently, the LP score functions reduce to normalized shifted Legendre polynomials. The latter are effectively the result of a Gram-Schmidt orthonormalization applied to powers of G(x). Whereas, the LP score functions are obtained by orthonormalizing powers of the standardized mid-distribution function with respect to the measure G.
Recall that, in our context, the cdfs G d , d = 1, . . . , p are the conditional and marginal distribution functions specified in the Rosenblatt's transform G R (x). Hence, an or- and || · || G d = √ < ·, · > G d .
When p > 1 a suitable tensor basis in L 2 (G) can then be constructed as in (3).

Discussion
This work proposes an informative approach to goodness-of-fit which connects exploratory and confirmatory data analysis to study multivariate distributions. By transforming the likelihood ratio on the unit cube, confidence regions can be constructed as in Corollary 4.4 to identify regions of the supportwhere significant deviations occurs. While this approach is practical only for problems in at most three dimensions, in more dimensions a detailed diagnosis of mismodeling can be achieved by means of the iGOF-diagnostic analysis proposed in Section 5. These tools can be used to directly address Q1 in Section 1. For instance, given the panacea of theories available on the nature of dark matter, experimentalists aiming to detect it often face the dilemma of selecting which of the tens of theoretical models (mainly non-nested) available should be tested (e.g., Scott, 2018).
If one was to test it using the procedure discussed in this paper, even when a given model is rejected, it is possible to gain further insight on the shape of the departure of the true data distribution and the null model and ultimately use such information to "rule out" other models which would be inconsistent with such deviation.
Moreover, as we aimed for when formulating Q2 in Section 1, the true probability function of the data can be estimated semi-parametrically via (10), while assessing the validity of the model postulated by the scientists. Interestingly, the resulting estimate incorporates the knowledge carried by the hypothesized model and thus, it provides a data-driven update for it in the direction of the true distribution of the data.
Despite the usefulness of the methods presented here in applied settings, and in the physical sciences in particular (e.g., Section 6), they are not exempt from limitations.
For instance, several problems in physics and astronomy, often involve no more than 8 or 10 dimensions and/or can be reduced to 2D planes (e.g., Aprile et al., 2017). In this context, choosing m d equal to 3 or 4 for all d = 1, . . . , p, is often sufficient to avoid overfitting and, eventually, lack of power by implementing adequate model selection strategies and for sufficiently large samples (see Sections 4 and 5). In more dimensions, however, the method suffers from the curse of dimensionality (e.g. Friedman et al., 2001), as the size of the LP tensor basis increases exponentially fast with p. In this context, a regularized solution could be particularly valuable (see for instance Signoretto et al., 2014) when analyzing, for instance, data coming from large astronomical surveys such as the Large Synoptic Survey Telescope (LSST) survey (e.g., Tyson, 2002). Alternatively, if the interest is merely in detecting signals without assuming a specific background model, a data-driven solution for high-dimensional data has been recently proposed by (??).
Furthermore, the unitary representation of the likelihood ratio in (2) relies on the Rosenblatt transform and which can lead to different configurations of U and, potentially, different estimators. While this aspect would require adequate treatment on its own, it is worth noting that this problem is essentially the same arising in the context of vine copulas (e.g., Nagler et al., 2017) and for which adequate model selection procedures exists (e.g., Panagiotelis et al., 2017;Dissmann et al., 2013).
Finally, the inferential procedures presented here extend classical smooth tests to the multivariate setting and allow us to visualize graphically the departure of F from G and study their substructures. Despite this article focuses on simple null hypothesis, that is, the postulated model is assumed to be fully specified, classical results on smooth tests (e.g., Thas, 2010, Sec 4.2.2.3 and 5.2.2.3 ) can be used to show to derive asymptotic tests in the parametric setting. Unfortunately, however, the asymptotic approximations are known to be rather slow in the parametric case. Therefore, in practical applications, when G depends on unknown parameters it is recommended to perform inference by means of the parametric bootstrap and which has been shown by Babu and Rao (2004) to be consistent also in the multivariate setting.
Symbol Description X = (X1, . . . , Xp) Random vector of components X d , d = 1, . . . , p X Support of X F, f True cdf and density of X G, g Postulated cdf and density of X M × 1 vector of components T k (u) = T k GR(x) θ M × 1 vector collecting the coefficients θ k θ M × 1 vector collecting the estimates θ k D Deviance statistics Table 5: A summary of the main notation used throughout the paper.