Pretest estimation in combining probability and non-probability samples

Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.


Introduction
It has been widely accepted that probability sampling, where each selected sample is treated as a representative sample to the target population, is the best vehicle for finite-population inference.
Since the sampling mechanism is known based on survey design, each weight-calibrated sample can be used to obtain consistent estimators for the target population; see Särndal et al. (2003), Cochran (2007) and Fuller (2009) for textbook discussions. However, complex and ambitious surveys are facing more and more hurdles and concerns recently, such as costly intervention strategies and lower participation rates. Baker et al. (2013) address some of the current challenges in using probability samples for finite-population inference. On the other hand, higher demands of small area estimation and other more factors have led researchers to seek out alternative data collection with less program budget (Williams and Brick;2018;2019). In particular, lots of attention has been drawn to the studies of non-probability samples.
Non-probability samples are sets of selected objects where the sampling mechanism is unknown. First of all, non-probability samples are readily available from many data sources, such as satellite information (McRoberts et al.;2010), mobile sensor data (Palmer et al.;2013), and web survey panels (Tourangeau et al.;2013). In addition, these non-representative samples are far more cost-effective compared to probability samples and have the potential of providing estimates in near real-time, unlike the traditional inferences derived from probability samples . Based on these big and easy-accessible data, a wealth of literature has been proposed which enunciates the bright future while properly utilizing such amount of data (e.g., 2013, Citro;2014, Tam and Clarke;2015, and Pfeffermann et al.;2015).
However, the naive use of such data cannot ensure the statistical validity of the resulting estimators because such non-probability samples are often selected without sophisticated supervision. Therefore, the acquisition of large whereas highly unrepresentative data is likely to produce erroneous conclusions. Couper (2000) and Elliott et al. (2017)

present more recent examples
where non-probability samples can often lead to estimates with significant selection biases. To overcome these challenges, it is essential to establish appropriate statistical tools to draw valid inferences when integrating data from the probability and non-probability samples. Various data integration methods have been proposed in the literature to leverage the unique strengths of the probability and non-probability samples; see Yang and Kim (2020) for a review, and the existing methods for data integration can be categorized into three types including the inverse propensity score adjustment (Rosenbaum and Rubin;1983;Elliott;, calibration weighting (Deville and Särndal;Kott;, and mass imputation (Rivers;Kim and Wang;2019;Yang et al.;2021).
But most of the works assume that the non-probability sample is comparable to the probability sample in terms of estimating the finite-population parameters, which may not be satisfied in many applications due to the unknown sampling mechanism of the non-probability samples.
Thus, the non-probability samples with unknown sampling mechanisms may bias the estimators for the target parameters. To resolve this issue, Robbins et al. (2021) propose a pretest to gauge the statistical adequacy of integrating the probability and non-probability samples in an application. The pretesting procedure has been broadly practiced in econometrics and medicine, and its implications are of considerable interests (e.g., Wallace (1977); Toyoda and Wallace (1979); Baltagi et al. (2003); Yang et al. (2022)). Essentially, the final value of the estimator depends on the outcome of a random testing event and therefore is a stochastic mixture of two different estimators. Despite the long history of the application of the pretest, few literature investigates the theoretical properties of the underlying non-smooth distribution for the pretest estimators.
In this paper, we establish a general statistical framework for the test-and-pool analysis of the probability and non-probability samples by constructing a test to gauge the comparability of the non-probability data and decide whether or not to use non-probability data in a pooled analysis. In addition, we consider the null, fixed, and local alternative hypotheses for the pre-testing, representing different levels of comparability of the non-probability data with the probability data. In particular, the non-probability sample is perfectly comparable under the null hypothesis, whereas it is starkly incomparable under the fixed alternative. Therefore, the fixed alternative cannot adequately capture the finite-sample behavior of the pre-testing estimator, under which the test statistic will diverge to infinity as the sample size increases. Toward this end, we establish the asymptotic distribution of the proposed estimator under local alternatives, which provides a better approximation of the finite-sample behavior of the pretest estimator when the idealistic assumption required for the non-probability data is weakly violated. Also, we provide a dataadaptive procedure to select the optimal values of the tuning parameters achieving the smallest mean square error of the pretest estimator. Lastly, we construct a robust confidence interval accounting for the non-regularity of the estimator, which has a valid coverage property.
The rest of the paper is organized as follows. Section 2 lays out the basic setup and presents an efficient estimator for combing the non-probability sample and the probability sample. Section 3 proposes a test statistic and the test-and-pool estimator. In Section 4, we present the asymptotic properties of the test-and-pool estimator, an adaptive inference procedure, and lastly a data-adaptive selection scheme of the tuning parameters. Section 5 presents a simulation study to evaluate the performance of our test-and-pool estimator. Section 6 provides a real-data illustration. All proofs are given in the Appendix.
2 Basic setup 2.1 Notation: two data sources . . , N } denote a finite population of size N , where X i is a vector of covariates and Y i is the study variable. We assume that F N is a random sample from a superpopulation model ζ and our objective is to estimate the finite-population parameter µ g ∈ R l , defined as the solution to where S(V i ; µ) is a l-dimensional estimating function. The class of parameters is fairly general.
is the coefficient of the finite-population regression projection of Y i onto X i .
Suppose that there are two data sources, one from a probability sample, referred to as Sample A, and the other from a non-probability sample, referred to as Sample B. Assume Sample A to be independent of Sample B, and the observed units can be envisioned as being generated through two phases of sampling (Chen et al.;2019). Firstly, a superpopulation model ζ generates the finite population F N . Then, the probability (or non-probability) sample is selected from it using some known (or unknown) sampling schemes. Hence, the considered total variance of estimators is based on the randomness induced by both the superpopulation model and the sampling mechanisms; see Table 1 for the notations of probability order, expectation and (co-)variance. For example, E p (· | F N ) is the average over all possible samples under the probability design for particular finite population F N , and E(·) is the average over all possible samples from all possible finite populations.
total variance o ζ-p-np (1), O ζ-p-np (1) E (·) var (·) , cov (·) Thus far, our focus has been on the setting where the covariates X and the study variable Y are available in both the probability and non-probability samples, which has also been considered in Elliott and Haviland (2007) and Elliot (2009). The sampling indicators are denoted by δ A,i and δ B,i , respectively; e.g., δ A,i = 1 if unit i is selected into Sample A and zero otherwise. Sample A contains observations O A = {(d i = π −1 A,i , X i , Y i ) : i ∈ A} with sample size n A , where π A,i is the known first-order inclusion probability for Sample A, and Sample B contains observations O B = {(X i , Y i ) : i ∈ B} with sample size n B . The unknown propensity score for being selected into Sample B is denoted by π B,i . Here, A and B denote the indexes of units in Samples A and B with total sample size n = n A + n B and negligible sampling fractions, i.e., n/N = o(1). Let the limits of the fractions of Sample A and B be f A = lim n→∞ n A /n and f B = lim n→∞ n B /n with 0 < f A , f B < 1.

Assumptions and separate estimators
As observing (X i , Y i ) for all units i in U is usually not feasible in practice, we can estimate the population estimating equation (1) by the design-weighted sample analog under the probability yielding a design-weighted Z-estimator µ A (van der Vaart; 2000). When S(V ; µ) is a score function, the resulting estimator will be a pseudo maximum likelihood estimator (Skinner et al.;. For example, for estimating We now make the following assumption for the design-weighted Z-estimator. Assumption 1. (Design consistency and central limit theorem) Let µ A be the corresponding design-weighted Z-estimator of µ g , which satisfies that var Under the typical regularity conditions (Fuller;, Assumption 1 holds for many common sampling designs such as probability proportional to size and stratified simple random sampling. Under Assumption 1, µ A is design-consistent and does not rely on any modeling assumptions. This explains why the probability sampling has been the gold standard approach for finite-population inference, and we make this assumption throughout this article. Let f (Y | X) be the conditional density function of Y given X in the superpopulation model ζ, and let f (X) and f (X | δ B = 1) be the density function of X in the finite population and the non-probability sample, respectively. To correct for the selection bias of the non-probability sample, most of the existing literature considers the following assumptions (e.g., Rivers;Vavreck and Rivers;2008;Chen et al.;2019).
Assumption 2. (Common support and ignorability of sampling) (i) The vector of covariates X has a compact and convex support, with its density bounded and bounded away from zero. Also, there exist positive constants C l and C u such that C l ≤ f (X)/f (X | δ B = 1) ≤ C u almost surely.
(ii) Conditional on X, the density of Y in the non-probability sample follows the superpopulation The sample inclusion indicator δ B,i and δ B,j are independent given X i and X j for i ̸ = j.
Assumption 2 (i) and (ii) constitute the strong sampling ignorability condition (Rosenbaum and Rubin;1983). Assumption 2 (i) implies that the support of X in the non-probability sample is the same as that in the finite population, and it can also be formulated as a positivity assumption that pr(δ B = 1 | X) > 0 for all X. This assumption does not hold if certain units would never be included in the non-probability sample. Assumption 2 (ii) is equivalent to the ignorability of the sampling mechanism for the non-probability sample conditional on the covariates X, i.e., (Little;1982). This assumption holds if the set of covariates contain all the outcome predictors that affect the possibility of being selected into the non-probability sample. Assumption 2 (iii) is a critical condition to employ the weak law of large numbers under the non-probability sampling design (Chen et al.;2019). Under Assumption 2, the non-probability sample can be used to produce consistent estimators. However, this assumption may be unrealistic if the non-probability data collection suffers from uncontrolled selection biases (Bethlehem;2016), measurement errors 2000), or other error-prone issues. Thus, we consider Assumption 2 as an idealistic assumption, which may be violated and require pretesting.
Under Assumptions 1 and 2, let Φ A (V, δ A ; µ) and Φ B (V, δ A , δ B ; µ) be two l-dimensional estimating functions for the target parameter µ g when using the probability sample and the combined samples, respectively. In practice, Φ A (·) and Φ B (·) may depend on unknown nuisance functions, and solving E{Φ A (V, δ A ; µ)} = 0 and E{Φ B (V, δ A , δ B ; µ)} = 0 is not feasible. By replacing the nuisance functions with their estimated counterparts, and the expectations with the empirical averages, we obtain µ A and µ B by solving Remark 1. For estimating the finite population means, that is, µ g = Y N , Φ A (·) and Φ B (·) are commonly chosen as where π B (X) = pr(δ B = 1 | X) and m(X) = E(Y | X, δ B = 1). To obtain the estimators µ A and µ B , parametric models π B (X; α) and m(X; β) can be posited for the nuisance functions π B (X) and m(X), respectively.
In addition, researchers might be interested in estimating the individual-level outcomes rather than the population-level outcomes. In this case, Φ A (·) and Φ B (·) can be specified for estimating the outcome model m(X; β) as: Next, we adopt the model-design-based framework for inference, which incorporates the randomness over the two phases of sampling 1983;Molina et al.;2001;Binder and Roberts;Xu et al.;2013). The asymptotic properties for µ A and µ B can be derived using the standard M-estimation theory under suitable moment conditions. Lemma 1. Suppose Assumptions 1, 2 and additional regularity conditions S3 hold. Then, we have where V A , V B , and Γ are defined explicitly in the Appendix .
In Lemma 1, we extend the conditional normality to unconditional as in Schenker and Welsh (1988), which implies that the asymptotic (co-)variances terms V A , V B and Γ refer to all the sources of uncertainty over the two phases.

Efficient estimator
Under Assumptions 1 and 2, both µ A and µ B are consistent, and it is appealing to combine µ A with µ B to achieve efficient estimation. We consider a class of linear combinations of the functions in (3): where Λ is the linear coefficient that gauges how much information of the non-probability sample should be integrated with the probability sample. Equation (7) leads to a class of composite estimators which is a weighted average of µ A and µ B with Λ-indexed weight ω A and ω B . When Λ = 0, (7) provides the design-consistent estimator µ A . The optimal choice Λ eff can be empirically tuned to minimize the asymptotic variance of the composite estimator, leading to the efficient estimator µ eff . However, the major concern for µ eff is the possible bias due to the violation of Assumption 2 (ii) for the non-probability sample. When it is violated, it is reasonable to choose Λ = 0 and prevent any bias associated with the non-probability sample.

Test-and-pool estimator
Motivated by the above reasoning, we develop a strategy that pretests the comparability of the non-probability sample with the probability sample first and then decides whether or not we should combine them for efficient estimation. We formulate the hypothesis test in Section 3.1, and construct the test-and-pool estimator in Section 3.2.

Hypothesis and test
We formalize the null hypothesis H 0 when Assumption 2 holds, and the fixed and local alternatives H a and H a,n when Assumption 2 is violated. To be specific, we consider where µ g,0 = E ζ (µ g ), µ g = µ g,0 + O ζ (N −1/2 ), and η fix , η are two fixed parameters. The fixed alternative H a is commonly considered in the standard hypothesis testing framework. However, it enforces the bias of the estimating function Φ B (·) to be fixed and indicates a strong violation of Assumption 2.2, under which the test statistic T will diverge to infinity with the sample size.
Moreover, the inference under the fixed alternative can not capture the finite-sample behavior of the test well and lacks uniform validity. On the contrary, the local alternative provides a useful tool to study the finite-sample distribution of non-regular estimators when the signal of violation is weak, i.e., in the n −1/2 B neighborhood of zero. In such cases, we allow the existence of a set of unmeasured covariates whose association with either the possibility of being selected into Sample B or the outcome is small. Also, the local alternative H a,n is more general in the sense that it reduces to H a with η = ±∞, and has been widely employed to illustrate the non-regularity settings, such as weak instrumental variables regression (Staiger and Stock;1997), regression estimators of weakly identified parameters (Cheng;2008) and test errors in classification 2011). We will mainly exploit the local alternative to show the inherent non-regularity of the pretest estimator.
Under the null hypothesis (8), µ B is consistent, and hence, it is reasonable to combine µ A and µ B for efficient estimation. However, when the null hypothesis is violated as in (10), the efficient estimator is biased. Lemma 2 presents the asymptotic properties of the separate and efficient estimators under H a,n .
Lemma 2. Suppose Assumptions 1, 2 (i) and (iii), and all the regularity conditions in Lemma 1 hold. Then, under the local alternative H a,n , the asymptotic distributions for µ A and µ B are The asymptotic distribution of the efficient estimator µ eff is and V eff are presented in Lemma S3.
By Lemma 2, among the three estimators µ A , µ B and µ eff , when H 0 holds, µ eff is optimal because it is consistent and the most efficient; while when H 0 is violated, µ A is optimal because it is consistent but the other two estimators are not.
We now use pretesting to guide choosing the estimators. To test H 0 , the key insight is that µ A is always consistent for µ g by Assumption 1, and if H 0 holds, Φ B,n ( µ A ) = n 1/2 should behave as a mean-zero random vector asymptotically. Thus, we construct the test statistic T as where Σ T is the asymptotic variance of Φ B,n ( µ A , τ ), and Σ T is a consistent estimator of Σ T .
The exact form of Σ T in (S15) involves V A , V B , and Γ. Thus, Σ T can be obtained by replacing the unknown components in the expression of Σ T with their estimated counterparts, and the expectations with the empirical averages. In addition, we can consider the replication-based method for variance estimation in Algorithm B.1 adapted from (Mashreghi et al.;.
Lemma 3 serves as the foundation for our data-driven pooling step in Section 3.2.

Data-driven pooling
If T is large, it indicates that H 0 may be violated and thus it is desirable to retain only the probability sample for estimation. If T is small, it indicates that H 0 may be accepted and suggests combining the probability and non-probability samples for efficient estimation. This strategy leads to the test-and-pool estimator µ tap as the solution to where c γ is the (1 − γ) critical value of χ 2 l . In (13), we can fix Λ to be the optimal form Λ eff leading to an efficient estimator under H 0 in Section 2.3. Alternatively, we view c γ and Λ jointly as tuning parameters that determine how much information from the non-probability sample can be borrowed in pooling. Larger c γ and Λ borrow more information from the non-probability sample, leading to more efficient but more error-prone estimators, and vice versa. We will use a data-adaptive rule to select (Λ, c γ ) that minimizes the mean squared error of µ tap .
Remark 2. Compare to the t-test-based pooling estimator in Mosteller (1948), our proposed method is more general in the sense that (a) the auxiliary covariates are used to provide a more informative model of µ g ; (b) our test statistic T is motivated by the estimating function, which can be more robust to model misspecification, and (c) a data-adaptive selection of (Λ, c γ ) is adopted for minimizing the post-integration mean squared error.

Asymptotic properties of the test-and-pool estimator
In this section, we characterize the asymptotic properties of µ tap . Before proceeding further, we introduce more notations. Let I l×l be a l×l identify matrix, F l (·; η) be the cumulative distribution function for χ 2 l with non-central parameter η, and F l (·) = F l (·; 0). Denote V A-eff = V A − V eff and V B-eff = V B − V eff , which are both positive-definite.

Asymptotic distribution
By construction, the estimator µ tap is a pretest estimator that first constructs T for pretesting H 0 and then forms the test-based weights for combining µ A and µ B . It is challenging to derive the asymptotic distribution of µ tap because it is involved with the test statistic T and two asymptotically dependent components µ A and µ B . In order to formally characterize the asymptotic distribution of µ tap , we decompose the asymptotic representation of µ tap by two orthogonal components, one is affected by the testing and the other is not.
Second, by Lemma 3, asymptotically, we write T as a quadratic form . We then find another standardized l−variate introduced for the purpose of standardization.
Third, µ tap can be asymptotically represented by two components involving W 1 and W 2 , respectively, one component is affected by the test constraint and the other component is not.
Following the above steps, Theorem 1 characterizes the asymptotic distribution of µ tap .
Theorem 1. Suppose the assumptions in Lemma 2 hold except that Assumption 2 (ii) may be violated as dictated by H a,n in (10). Let W 1 and W 2 to be independent normal random vectors with mean µ 1 and µ 2 (given below, which vary by hypothesis) and variance matrices I l×l . The test-and-pool estimator µ tap follows the following asymptotic distribution Theorem 1 reveals that the asymptotic distribution of µ tap depends on the local parameter η and thus characterizes the non-regularity of the pretest estimator. When H 0 is violated weakly (a small perturbation in the true data generating model), the asymptotic distribution of µ tap can change abruptly depending on η. The non-regularity of µ tap also poses challenges for inference as shown in Section 4.3. Based on Theorem 1, we derive the asymptotic biases and mean squared errors of µ tap under H 0 and H a,n , which serve as the stepping stone to a data-driven procedure to select the tuning parameters Λ and c γ .

Asymptotic bias and mean squared error
Based on the Theorem 1, the asymptotic distribution of µ tap involves elliptical truncated normal distributions (Tallis; 1963; Barr and Sherrill;. To understand the asymptotic behavior of our proposed estimator, it is crucial to comprehend the essential properties of elliptical truncated multivariate normal distributions. We derive the moment generating function and subsequently the mean square error of the estimator µ tap . The exact form of mean squared error given by mse(Λ, c γ ; η) in (S35), albeit complicated, reveals that the amount of information borrowed from the non-probability sample (controlled by Λ and c γ ) should tailor to the strength of violation of H 0 (dictated by local parameter η). For illustration, we consider a toy example in the supplemental material.
We search for the optimal values (Λ * , c * γ ) that minimize mse(Λ, c γ ; η) using standard numerical optimization algorithm (Nelder and Mead;1965), where η = Φ B,n ( µ A , τ ). Note that the decision of rejecting H 0 or not is subject to the hypothesis testing errors, namely the Type I error and Type II error. That is, the test statistic T can be larger than c γ even when H 0 holds; similarly, it can be small when H a,n holds. However, the data-adaptive tuning procedure aims at minimizing the mean squared error of the estimator µ tap , which implicitly restricts these two testing errors to be small.

Adaptive inference
Standard approaches to inference, e.g., the nonparametric bootstrap, require the estimators to be regular (Shao;1994). In non-regular settings, researchers have proposed alternative approaches such as the m-out-n bootstrap or subsampling. However, these approaches critically rely on a proper choice of m or the subsample size; otherwise, the small sample performances can be poor.
The non-regularity is induced because the asymptotic distribution of the estimator µ tap depends on the local parameter, thus, it does not converge uniformly over the parameter space. Laber and Murphy (2011) propose adaptive confidence intervals for test errors in the classification problems.
Following this idea, we construct the bound-based adaptive confidence interval (BACI) for the estimator µ tap that guarantees good coverage properties. To avoid the non-regularity, our general strategy is to derive two smooth functionals that bound the estimator µ tap . Because these two functionals are regular, standard approaches to inference can be adopted and valid confidence intervals follow.
To be concrete, we construct a bound-based adaptive confidence interval for a T µ g , where a ∈ R l is fixed. By Theorem 1, we can reparametrize the asymptotic distribution of a T n 1/2 ( µ tap − µ g ) where and µ t [cγ ,∞) = µ 2 1 µ T 2 µ 2 >cγ . By construction, R n is regular and asymptotically normal, but U n is nonsmooth. Nonsmoothness and nonregularity are interrelated. To illustrate, if µ 2 = 0, U n follows a standard truncated normal distribution with truncated probability pr(W T 2 W 2 ≤ c γ | µ 2 = 0); whereas, if |µ 2 | → ∞, pr(W T 2 W 2 ≤ c γ | µ 2 ) diminishes to zero, implying that U n follows a standard normal distribution. Thus, the limiting distribution of a T n 1/2 ( µ tap − µ g ) is not uniform over local parameter µ 2 (or equivalently η).
Our goal is to form the least conservative smooth upper and lower bounds. An important observation is that if |µ 2 | is sufficiently large, we may treat U n as regular. Thus, we define B as the nonregular zone for µ T 2 µ 2 such that max µ T 2 µ 2 ∈B pr(W T 2 W 2 ≥ c γ | µ 2 ) ≤ 1 − ε for small ϵ > 0 and B c the regular zone. When µ T 2 µ 2 ∈ B c , standard inference can apply, and bounds are only needed when µ T 2 µ 2 ∈ B to avoid the inference procedure to be overly conservative. We then require another test procedure to test µ Figure 1 illustrates the regular and nonregular zones and the test. If T ≥ ν n , we conclude the regularity of the estimator µ tap and construct a normal confidence interval, but if T < ν n , we construct the least favorable confidence interval by taking the union for all µ 2 ∈ R l . In practice, v n can be determined by the double bootstrapping satisfying the regularity condition that lim n→∞ v n /n = 0; see Section B.4 of the supplemental material for more details.
Accordingly, U n can be decomposed into two components )1 T <vn and only regularize (i.e., deriving bounds for) the latter component. Continuing with (14), we can take the supremum over all µ 2 in the nonregular zone to construct the upper bound U (a), The lower bound L(a) for a T n 1/2 ( µ tap − µ g ) can be computed in an analogous way by replacing sup with inf in (15). Taking the supremum and the infimum of µ 2 over R l renders the two bounds Figure 1: Illustration of the nonregular zone B (shaded) and two power functions: the solid and dash lines are pr(W T 2 W 2 > c γ | µ T 2 µ 2 ) and pr(T ≥ v n | µ T 2 µ 2 ) as functions of µ T 2 µ 2 , respectively U (a) and L(a) smooth and regular. The limiting distribution of U (a) is Similarly, the limiting distribution of L(a) is (16) by replacing sup with inf. Based on the limiting distribution of U (a) and L(a), if pr(µ T 2 µ 2 ∈ B) = 0, U (a) and L(a) have approximately the same limiting distributions as a T n 1/2 ( µ tap −µ g ). However, if pr(µ T 2 µ 2 ∈ B) ̸ = 0, U (a) is stochastically larger and L(a) is stochastically smaller than a T n 1/2 ( µ tap − µ g ).
Based on the regular bounds U (a) and L(a), we construct the (1 − α) × 100% bound-based adaptive confidence interval of a T µ g as where U d (a) and L d (a) approximate the d-th quantiles of the distribution of U (a) and L(a), respectively, which can be obtained by the nonparametric bootstrap method.
Theorem 2. Assume the conditions in Theorem 1 hold true. Furthermore, assume matrices Σ T , Σ S in Lemma 2 and their consistent estimates Σ T , Σ S are strictly positive-definite, and the sequence v n satisfies v n → ∞ and v n /n → 0 with probability one. The asymptotic coverage rate of (17) satisfies In particular, if Assumption 2 is strongly violated with pr(µ T 2 µ 2 ∈ B c ) = 1, the inequality in (18) becomes equality.
Remark 3. We discuss an alternative approach to construct valid confidence intervals for the non-regular estimators using projection sets (Robins;2004) (referred to as projection-based adaptive confidence intervals (PACI), C PACI µg,1−α (a)). The basic idea is as follows. For a given µ 2 , the limiting distribution of µ tap is known and a regular (1 −α 1 ) × 100% confidence interval C µg,1−α 1 (a; µ 2 ) of a T µ g can be formed through the standard procedure. Since µ 2 is unknown, a (1−α)×100% projection confidence interval of µ g can be conservatively constructed as the union of all C µg,1−α 1 (a; µ 2 ) over µ 2 in its (1 −α 2 ) × 100% confidence region, where α =α 1 +α 2 . Such strategy may be overly conservative, and in that way, the projection-based adaptive confidence interval then introduces a pretest in order to mitigate the conservatism. If the pretest rejects H 0 : µ T 2 µ 2 ∈ B, C µg,1−α 1 (a; µ 2 ) is used; otherwise, the union of C µg,1−α 1 (a; µ 2 ) is used. The technical details for the C PACI µg,1−α (a) are presented in the supplemental material. Our simulation study later shows that the C PACI µg,1−α (a) is more conservative than the proposed C BACI µg,1−α (a).

Simulation study
In this section, we evaluate the finite-sample performances of the proposed estimator µ tap and C BACI µg,1−α (a). First, we generate the finite population F N with size N = 10 5 . For each subject i, Generate samples from the finite population F N by Bernoulli sampling with specified inclusion probabilities where ν A and ν B are adaptively chosen to ensure the target sample sizes n A ≈ 600 and n B ≈ 5000. We assume that (X i , Y i ) are observed but u i is unobserved, and we vary b in {0, 10, 100} to represent the scenarios where H 0 holds, is slightly violated or strongly violated , respectively.
We compare the estimator µ tap with other estimators: (5), where (α, β) are estimated by using the maximum pseudo-likelihood estimator α and the ordinary least square estimator β (Haziza and Rao;; see Equations (S24) and (S26). (d) µ eff : the solution to (7) with the optimal choice Λ eff specified in (S11) and the consistent estimators ( α, β) obtained from (c). (e) µ eff:B : µ eff , where α is estimated in the same manner as (c) but β is estimated solely based on the non-probability sample; see Equation (S25).
For all estimators, we specify the model π B (X; α) to be a logistic regression model with X i and the outcome mean model m(X; β) to be a linear regression model with X i . For non-regular estimators µ tap , µ tap:B and µ eff:KH , we construct the C BACI µg,1−α (a) in (17) with a data-adaptiv choice of ν n , the C BACI µg,1−α (a) with a fixed v n = log log n{C BACI:F µg,1−α (a)} (BACI F ), and the C PACI µg,1−α (a). For any confidence intervals requiring the nonparametric bootstrap, the bootstrap size is 2000.
For the Bayesian estimators, the point estimates are obtained by the Markov chain Monte Carlo sampling with size 2000 after additional 500 burn-in samples. Table 2 reports the bias, variance and mean squared error of each estimator over 2000 simulated datasets. The benchmark estimators µ A have small biases across all scenarios, guaranteed by the probability sampling design. On the other hand, the non-probability-only estimators µ B exhibit high biases in all cases, mainly due to the effect of selection bias. When the impact of the unmeasured confounder b increases, the pooled estimators µ eff , µ eff:B and µ eff:KH are becoming more biased. Additionally, the Bayesian methods, particularly µ Bayes:2 , perform reasonably well when H 0 holds or is slightly violated, but it tends to have large biases when H 0 is strongly violated. Whereas the proposed estimators µ tap , µ tap:B and µ tap:KH have small biases regardless of the strength of the unmeasured confounder. When H 0 is slightly violated, our proposed estimators have slightly larger biases but smaller mean squared errors than µ A by integrating the non-probability sample. When H 0 is strongly violated, the proposed estimators perform similarly to µ A with the protection of pretesting. Table 3 reports the properties of 95% Wald confidence intervals for the regular estimators, the highest posterior density intervals (HPDIs) for the Bayesian estimators, and various adaptive confidence intervals for the non-regular estimators µ tap , where the Wald confidence intervals are constructed, and the Bayesian credible intervals are constructed based on the posterior samples after burn-in. Because the confidence intervals (and the point estimates; see Table 2) are not sen- sitive to the methods of estimating the nuisance parameters (α, β), we only present the confidence intervals for µ eff:KH and µ tap:KH for simplicity. Based on Table 3, C PACI µg,1−α tend to overestimate the uncertainty, leading to over-conservative confidence intervals. C BACI µg,1−α and C BACI:F µg,1−α are less conservative and alleviate the over-coverage issues; thus, the empirical coverage rates are close to the nominal level in all cases. Moreover, C BACI µg,1−α have narrower intervals than C BACI:F µg,1−α by using the double bootstrap procedure to select v n at the expense of computational burden. When H 0 holds, the C BACI µg,1−α are narrower than the Wald for the probability-only estimator µ A , indicating the advantages of implementing the test-and-pool strategy in these cases. When H 0 is slightly violated, the benefit in coverage rate is not significantly observed under similar coverage rates.
When H 0 is strongly violated, the adaptive confidence interval C BACI µg,1−α reduces to the Wald confidence intervals for µ A . Lastly, the credible intervals for the Bayesian estimators do not have satisfactory coverage properties as the model misspecification persists across scenarios, which is aligned with the Bernstein-von Mises Theorem (van der Vaart; 2000, Chapter 10.2).

A real-data illustration
To demonstrate the practical use, we apply the proposed method to a probability sample from the 2015 Current Population Survey (CPS) and a non-probability sample from the 2015 Behavioral Risk Factor Surveillance System (BRFSS) survey. Note that the Behavioral Risk Factor Surveillance System survey itself is a probability sample and we manually discard its sampling weights to recast it as a non-probability sample for illustrating our proposed method.
To apply the proposed method, we use a two-phase sampling survey data with sizes n A = 1000 and n B = 8459. We focus on two outcome variables of interest: employment (percentages of working and retired) and educational attainment (high school or less as h.s.o.l, and college or above as c.o.a.). Both datasets provide measurements on the outcomes of interest and some common covariates including age, sex (female or not), race (white and black), origin (Hispanic or not), region (northeast, south, or west), and marital status (married or not). To illustrate the heterogeneity in the study populations, Table 4 contrasts the means of variables from the CPS sample (design-weighted averages) and the BRFSS sample (simple averages). Based on Table 4, the BRFSS sample may not be representative of the target population, and the pretesting procedures before pooling should be expected. Table 5 presents the results. For all estimators, we specify the propensity score model to be a logistic regression model with the covariates (all variables excluding the outcome variable) and the outcome mean model to be a logistic regression model with the covariates. The efficient estimator µ eff gains efficiency in all estimators compared to both µ A and µ bc ; however, it may be subject to biases if the non-probability sample does not satisfy the required assumptions. In the test-and-pool analysis, the pretesting rejects the use of the non-probability sample for the employment variables "working" and "retired " but accepts the use of the non-probability sample for the education variables "high school or less" and "college or above". Thus, for the employment variables, µ tap = µ A , and for the educational attainment variables, µ tap gains efficiency over µ A .
The Bayesian estimators with the informative priors 2 and 3 are more efficient than the prior 1.
However, they still yield larger standard errors compared to the probability-only estimator µ A perhaps because the non-probability-based informative priors are biased for the model parameters for the probability sample. From the test-and-pool analysis, the employment rate and the retirement rate are 58.7% and 13.6%, respectively, the percentage of the U.S. population with a high school education or less is 39.0% and the percentage of the population with a college education or above is 31.7% in 2015.

Concluding remarks
When utilizing the non-probability samples, researchers often assume that the observed covariates contain all the information needed for recovering the sampling mechanism. However, this assumption may be violated, and hence the integration of the probability and non-probability samples is subject to biases. In this paper, we propose the test-and-pool estimator that firstly scrutinizes the assumption required for combining by hypothesis testing and carefully combines the probability and non-probability samples by a data-driven procedure to achieve the minimum mean squared error. In theoretical development, we treat (Λ, c γ ) jointly as two tuning parameters and establish the asymptotic distribution of the pretesting estimator without taking their uncertainties into account. The non-regularity of the pretest estimator invalidates the conventional method for generating reliable inferences. To address this issue, the proposed adaptive confidence interval has been designed to effectively handle the non-smoothness of the pretest estimator and ensure uniform validity of inferences. It is important to note, however, that this approach may result in a little gain in the precision of the confidence interval, although the point estimator might have a significant gain in the MSE compared to the estimator based only on the probability sample. Further research is required to develop a valid post-testing confidence interval that offers reduced conservatism.
Pretest estimation is the norm rather than the exception in applied research, so the theories that we have established are highly relevant to researchers who engage in applied work. The Beaumont (2020) discussed the trade-off of the efficiency gain from invoking model assumptions and the risk that these assumptions do not hold. Thus, pretesting can be potentially useful for small-area estimation, which we will investigate in the future.

Acknowledgment
Yang's research is partially supported by NIH 1R01AG066883 and 1R01ES031651.

A Proofs
A.1 Regularity conditions ; µ, τ ) be l−dimensional estimating functions for the parameter µ g ∈ R l when using the probability sample and the combined samples, respectively. Let Φ τ (V, δ A , δ B ; τ ) be the k-dimensional estimating equations for the nuisance parameter τ 0 ∈ R k . Then, we construct one stacked estimating equation system For establishing our stochastic statements, we require the following regularity conditions. Assumption S3. The following regularity conditions hold.
a) The parameter θ is integrable with respect to the joint distribution of (V, δ A , δ B ) for all θ in a neighborhood of θ 0 .
d) The first two partial derivatives of E{Φ(V, δ A , δ B ; θ)} and their empirical estimators are invertible for all θ in a neighborhood of θ 0 . e) For all j, k, l ∈ {1, · · · , 2l + k}, there is an integrable function B(V, δ A , δ B ) such that h) There exist C 1 and C 2 such that 0 < C 1 ≤ N π A,i /n A ≤ C 2 and 0 < C 1 ≤ N π B,i /n B ≤ C 2 for all i ∈ U.
Assumption S3 a)-e) are typical finite moment conditions to ensure the consistency of the solution to the estimating functions 1994, Appendix B), (Tsiatis; 2006, Section 3.2), (Boos and Stefanski;2013, page 293) and (Vermeulen and Vansteelandt; 2015, Appendix C). Assumption S3 f) is required for obtaining the asymptotic normality of µ g under superpopulation.
Assumption S3 g) states that the sampling fraction is negligible, which is helpful for subsequent variance estimation, and we can use ) and O(n −1/2 ) interchangeably. Assumption S3 h) implies that the inclusion probabilities for Samples A and B are in the order of n/N , which is necessary to establish their root-n consistency.
It is noteworthy that in Assumption 1, the asymptotic normality is ascertained for the designweighted estimators given the finite population F N . Hereby, we extend the conditional normality to the unconditional one, which averages over all possible finite populations satisfying the Assumption S3 (f). The following lemma plays a key role to establish the stochastic statements (Fuller;, Theorem 1.3.6.).
Lemma S1. Under Assumption 1 and Assumption S3 (f), let {F N } be a sequence of finite populations and A N be a sample selected from the N th population by PR design with size n N . Assume We know that the distribution of the design-weighted estimator µ g and finite-population estimator µ g are both asymptotically normal distributed such that where · ∼ denotes the asymptotic distribution. Then, µ g − µ g is also asymptotically normal.
By lemma S1, the sampling fraction is negligible, and therefore the limiting variance of is 0, indicating that the intermediate step of producing the finite population is of little significance.

A.2 Proof of Lemmas 1 and 2
In the general case, we begin to investigate the statistical properties of First, to simplify our notations, leṫ By the Taylor expansion of Φ B,n ( µ B , τ ) at (µ g , τ 0 ), we have for some ( µ * B , τ * ) lying between ( µ B , τ ) and (µ g , τ 0 ), which leads to Also, under Assumption S3 a), b) and c), by the Taylor expansion, we have as τ → τ 0 . Also, under Assumption S3 (e), we know that where the first two probability convergence can be straightforward to obtain by Weak Law of Large Numbers under Assumption S3 f) and continuous mapping theorem as µ g → µ g,0 , ( µ A , τ ) → (µ g,0 , τ 0 ) by design and ( µ * A , τ * ) is lying between ( µ A , τ ) and (µ g,0 , τ 0 ). As for the third and fourth probability convergence, we first prove that µ B,0 − µ g,0 = o np-p-ζ (1) under the Lemma S2. Under Assumptions 1, 2 (iii) and suitable moments conditions in Assumption S3, Next, we have under Assumption S3 e), where A n ∼ = B n means that A n = B n + o ζ-p-np (1) and µ # B lies between µ * B and µ g,0 . Since µ B → µ B,0 , µ g → µ g,0 and µ * B lies between µ B and µ g , we establish the second approximation in (S5) as can be established similarly and hence we obtain the last two parts of (S4). By plugging (S3) and (S4) into (S2), we obtain the influence function for µ B as where ψ B (V i ; µ, τ ) is the influence function for estimation of µ B under H 0 . For completeness, we define the influence function ψ A (V i ; µ, τ ) for estimator µ A in an analogous way as where ϕ A,r (V ; µ, τ ) = ∂Φ A (V, δ A ; µ, τ )/∂τ . By Lemma S1, the joint asymptotic distribution for n 1/2 ( µ A − µ g ) and n 1/2 ( µ B − µ g ) would be where V A , Γ and V B are the total (co-)variance of two-phase design averaging over the finite populations: where the first term is attributed to the randomness of probability (and non-probability) sample designs, and the second term is attributed to the randomness of the superpopulation model. The rest of the proof is summarized in Lemma S3.
Lemma S3. Under the Assumption S3 and the asymptotic joint distribution for µ A and µ B in Lemma 2, the form of µ eff which maximizes the variance reduction under H 0 would be where the weight functions are whereΦ A,B,n (Λ, µ g,0 , τ 0 ) =Φ A (V i , δ A,i ; µ g,0 , τ 0 ) + ΛΦ B (V i , δ A,i , δ B,i ; µ g,0 , τ 0 ). The most efficient estimator µ eff with has the asymptotic distribution under H a,n as When µ A and µ B are both scalar, V eff would reduce to

A.3 Proof of Lemma 3
By applying the Taylor expansion with Lagrange forms of remainder to the asymptotic distribution for n 1/2 (12) could be shown as is the neighborhood of (µ g,0 , τ 0 ) ⊺ as plim µ A = µ g,0 and plim τ = τ 0 . Under the Assumption S3 e), we have Next, by replacing the first two term in Equation (S13) with Equation (S2), we have provided by WLLN under Assumptions 1, 2 (iii) and Assumption S3. By the joint distribution Thus, the asymptotic distribution for Φ B,n (V i , δ A,i , δ B,i ; µ A , τ ) would be

A.4 Proof of Theorem 1
From Lemma 1 and 2, we know that the asymptotic joint distribution for µ A and µ B would be For simplicity, we let n 1/2 ( µ A − µ g ) and n 1/2 ( µ B − µ g ) be asymptotically distributed as Z 1 and Z 2 , respectively. Then, n 1/2 Let Next step, we attempt to find another linear combination of Z 1 and Z 2 which is orthogonal to U 2 . Observed that when it is easy to verify that the covariance of U 1 and U 2 is zero under H 0 .
Also, since U 1 and U 2 are both asymptotically normal distributions, which implies that zero covariance leads to independency. After a few standardization procedures, we have W 1 and W 2 as (S15) Therefore, we have the form for the standardized random variables W 1 and W 2 as Here we use −Σ −1/2 T to standardize U 2 for the sake of convenience later. Therefore, under the local alternative H a,n : η. Combining the above leads to and since W 1 ⊥ W 2 , we could project out TAP estimator µ tap with the optimal tuning parameter (Λ * , c γ * ) onto these two basis respectively. First, on the condition that Next, on the condition T = W ⊺ 2 W 2 ≤ c γ * , we have where W t 2 = W 2 |W ⊺ 2 W 2 ≤ c γ , and ω * A , ω * B are the new tuned weighted functions defined in (S9) and (S10) with Λ = Λ * . In this way, we could fully characterize the asymptotic distribution for the TAP estimator µ tap under the optimal tuning parameter as, A.5 Proof of the bias and mean squared error of n 1/2 ( µ tap − µ g )the test-

and-pool estimator
For general case, given W 2 ∼ N p (µ 2 , I p×p ), the MGF of truncated normal distribution W 2 |a ≤ is the normalization constant and F p (a; µ ⊺ 2 µ 2 /2) is CDF of chi-square distribution at value a with non-central parameter µ ⊺ 2 µ 2 /2. The second and the third equality above are justified by To compute the first and second moment of this truncated normal distribution, we take derivative of the MGF and evaluate the function at t = 0 By the nature of MGF, we obtain the expectation of the first moment of W 2 Then, taking the second derivative of the MGF follows by which leads to In our case, The MSE can be derived based on the known formula mse( , the corresponding bias and MSE would be Overall, the bias and mean squared error for n 1/2 ( µ tap − µ g ) can be characterized as bias(λ, c γ ; η) = ξ · bias(λ, c γ ; η) T ≤cγ + (1 − ξ) · bias(λ, c γ ; η) T >cγ , mse(λ, c γ ; η) = ξ · mse(λ, c γ ; η) T ≤cγ + (1 − ξ) · mse(λ, c γ ; η) T >cγ .

A.6 Proof of the asymptotic distribution for U (a)U(a)
Throughout the proof, we assume that the regularity conditions in Lemma 1 and assumptions in Theorem 2 hold, we prove that the coverage probability for the adaptive projection sets is guaranteed to be larger than 1 − α, which is where C BACI µg,1−α (a) = a T µ tap − U 1−α/2 (a)/ √ n, a T µ tap − L α/2 (a)/ √ n . As we already know it is needed to show that U (a) obtained by bootstrapping converges to the same asymptotic distribution as U (a). Let D p×p denotes the space of p × p symmetric positive-definite matrices equipped with the spectral norm. We can rewrite U (a) as Next, we adopt the notation for the bootstrapping to express the upper bound U (a) = U (b) (a) as Next, we define some functions to proceed our proof. w 11 : where G A = n 1/2 ( µ A − µ g ) and G B = n 1/2 ( µ B − µ g ). Using the functions we have defined, we could re-express the upper bound U (a) in terms of Assume the conditions in Theorem 2, we can show that 1. w 11 is continuous at points in (Σ T , Σ S , R l , R l , R d , µ 2 ) and w 12 is continuous at points in B , µ 2 ) converge to zeros with probability one as n → ∞ uniformly in µ 2 . That is, See Lemma B.9. and Lemma B.11. in Laber et al. (2014) for details.
By far, combine (S16) and (S17), U (a) is guaranteed to be continuous, and the continuity of

A.7 Proof of Theorem 2
Based on the established consistency of the bootstrapping bounds in Section A.6, the proof can be decomposed into two parts. One part is for where G U (·) is the cumulative distribution function for U (a). Let G U (·) be the empirical cumulative distribution function U (a) estimated by bootstrapping. Similarly, we can show that the other part of our proof as where G L (·) is the cumulative distribution function for L(a). Combine the results we have above, we can obtain that Thus, the proof is completed.
A.9 Proof of Lemma S1 Following the similar arguments in Schenker and Welsh (1988), let F (·) and G(·) be the cumulative distribution function (c.d.f.) of N (µ g , V 1 ) and N (−µ g,0 , V 2 ). Let Φ(t) be the convolution of G(·) and F (·) as Φ(·) = (G * F )(·), then we have where s = t + µ g = t − (−µ g ). By Lemma 3.2 in Rao (1962), converges to 0 uniformly in x. For the first term, we have Since F (·) and G(·) are both bounded and continuous, by the dominated convergence theorem, the second term is which also converges to 0 (Schenker and Welsh; 1988, Lemma 1). Hence, the asymptotic c.d.f of µ g − µ g is Φ(·) and the result follows as the convolution of Gaussians is still Gaussian (Abramowitz et al.;1988;Boas;.

A.10 Proof of Lemma S2
Under Assumptions 1, 2 (iii) and Assumption S3 f), we have for some µ * B between µ B,0 and µ g,0 , where where for (S19), the first approximation E np-p (· | F N ) is based on the design consistency and the non-probability sample-based Weak Law of Large Numbers under Assumption 2 (iii), and the second approximation E ζ (·) is justified under Assumption S3 f); For (S20), it can be obtained by continuous mapping theorem as µ g = µ g,0 +O ζ (N −1/2 ) under Assumption S3 f). By rearranging the terms under the local alternative, it follows that

A.11 Proof of Lemma S3
First, we show that the composite estimator µ pool is essentially the solution to Next, under the Assumption S3 a)-d), we apply the Taylor expansion at point (µ g , τ 0 ) which leads to for some ( µ * pool , τ * ) between ( µ pool , τ ) and (µ g , τ 0 ). Given the asymptotic joint distribution for µ A and µ B in Lemma 2, we obtain for some intermittent value µ * pool between plim µ pool and µ g,0 , where Equation (S21) is obtained by using Equation (S7) and (S2) collectively. By Assumptions 1, 2 (iii) and suitable moments condition in Assumption S3, under the local alternative, n 1/2 ( µ pool −µ g ) would follow the normal distribution with mean and variance as obtained by the similar arguments in (S5). Plugging (S11) into Equation (S21), the asymptotic distribution of the most efficient estimator µ eff follows It yields a similar efficient estimator as derived in Yang and Ding (2020) with where it is easy to show that ω A + ω B = I l×l . So that the asymptotic variance V eff of this efficient estimator will become The expression of V eff can be complicated when the dimension of the parameters of interest is greater than 1. Here, we provide the form of V eff when estimating equations are (4) and (5): −1 guaranteed to be non-negative definite, i.e., non-negative quantity. By Cauchy-Schwarz inequality, we have which leads to √ V A V B ≥ Γ, and therefore where the two sides are equal if and only if V A = V B = Γ. The asymptotic variance of the efficient estimator for other multi-dimensional estimating equations can be obtained in an analogous way but with much heavier notations.

B.1 A detailed illustration of simulation
Here, we will provide detailed proof for estimating the finite-population parameter µ y = µ g = . First, we know the following expectation that To obtain the asymptotic joint distribution µ A and µ B , the stacked estimating equation system where we use µ A and µ B to distinguish between estimators yielded by Φ A (V, δ A ; µ A ) and Φ B (V, δ A , δ B ; µ B , τ ).
By positing a logistic regression model π B (X i ; α) = exp(X T i α)/{1 + exp(X T i α)} and a linear model m( where τ = (α, β) and π A is the known sample weights under probability samples accounting for sample design. There are various ways to construct the estimating functions Φ τ (V i ; α, β) for (α, β). One standard approach is to use the pseudo maximum likelihood estimator α and the ordinary least square estimator β (Scharfstein et al.;Haziza and Rao;. In usual, the maximum likelihood estimator of α can be computed by maximizing the log-likelihood function Since we do not have the X i for all units in the finite population, we then instead construct the following pseudo log-likelihood function l * (α) where the second equality is derived under the logistic regression model for π B (X i ; α). By taking derivative of l * (α) with respect to α, the estimating functions for (α, β) can be constructed as follows: both Sample A and Sample B provide information on X and Y , thus we can also consider the estimating equation based on the combined samples for β: In addition, Kim and Haziza (2014) propose a new set of estimating functions, in which ( α, β) are obtained by jointly solve the following estimating functions: Under Assumption S3 a)-e), we could apply the Taylor expansion to around θ y = (µ y , µ y , τ T 0 ) T and obtain for some θ * = ( µ * A , µ * B , τ * T ) T lying between θ and θ y . Under Assumption 1, the consistency of µ A for µ y can be established, i.e., µ A = µ y + O p (n −1/2 ). Moreover, under Assumption S3 f), we have µ y = µ 0 + O ζ (N −1/2 ) and hence plim µ * A = µ 0 , i.e., µ * A converges to µ 0 in probability. Under Assumption S3 b), µ B is consistent to µ B,0 , and µ B,0 = µ 0 +O ζ-p-np (n −1/2 ) under the local alternative. Denote θ 0 = (µ 0 , µ 0 , τ T 0 ) T , and the following uniform convergence can be established under Assumption S3 (a)-(c) and (e) and by Assumption S3 (d), we have Rearrange the terms of (S29), we then have where ϕ(θ) = ∂Φ(V, δ A , δ B ; θ)/∂θ T . For the simplicity of notation, we denote π B (X i ; α) = π B,i , m(X i ; β) = m i ,ṁ i = ∂m (X i ; β) /∂β, and its expectation is given by where Ω = 0 if Φ τ (V, δ A , δ B ; α, β) is constructed by (S24) and (S25), and Ω = 1 if Φ τ (V, δ A , δ B ; α, β) is constructed by (S24) and (S26); π * B,i = pr(δ B,i = 1 | X i ) is the true probability. In addition, if (S27) and (S28) are used to estimate τ , it gives us Below, we focus on the asymptotic properties of n 1/2 ( θ − θ y ) under (S30), and the asymptotics under under (S31) can be obtained in an analogous way. First, the inverse of E {ϕ(θ)} is As shown in Chen et al. (2019) under Assumption S3 g), the asymptotic variance of µ B will not be affected by the estimated β. Let π B,i,0 = π B (X i ; α 0 ) and m i,0 = m(X i ; β 0 ) be the correct working model evaluated the true parameter value (α 0 , β 0 ). Therefore, can be found by using the decomposition Since the probability sample is assumed to be independent of the non-probability sample (Chen et al.;2019), we could express the variance for N i=1 Φ(V i , δ A,i , δ B,i ; θ y ) as two components V 1 and V 2 under Assumption 1 and 2 (iii) where By the law of total variance, we have where the second term will be negligible under Assumption S3 g) and h). Similar arguments , therefore, (S32) and (S33) follow. The sub-matrices D kl , k = 1, · · · , 3, l = 1, · · · , 3 are all design-based variance-covariance matrices under the probability sampling design, and can be obtained using standard plug-in approach.
Algorithm B.1. Replication-based method for estimating variance of µ A and µ B Input: the probability sample {(V i , δ A,i ) : i ∈ A}, the non-probability sample i ∈ B} and the number of bootstrap K.
for b = 1, · · · , K do Sample n A units from the probability sample with replacement as A (b) .
Sample n B units from the non-probability sample with replacement as B (b) .
Compute the bootstrap replicates µ Alternatively, a with-replacement bootstrap variance estimation can also be used here . To illustrate, we consider a single-stage probability proportional to size sampling with negligible sampling ratios. Following Shao and Tu (2012), the bootstrap procedures in Algorithm B.1 are conducted.
Under Assumptions 1 and S3, θ − θ y | F N and θ y are both approximately normal, which leads to the asymptotic normality of the unconditional distribution over all the finite populations by Lemma S1: Thus, the asymptotic variance for the joint distribution n 1/2 ( µ A − µ y , µ B − µ y ) T is obtain by the 2 × 2 submatrix corresponding as where E{∂Φ B (µ 0 , τ 0 )/∂µ} = −1.
a) In the leftmost plot, where H 0 holds, for a given Λ, the mean squared error decreases drastically and then flattens out as c γ increases. Moreover, for a given c γ , there exists a minimizer Λ * such that the mean squared error achieves the minimum. These observations justify our strategy by viewing Λ and c γ jointly as tuning parameters since both of them are playing important roles when searching for the minimum value of mean squared error.
b) In the middle plot, where H 0 is weakly violated, the pattern of the mean squared error retains the similar features for c γ as shown in (A). In addition, the optimal choice Λ * leads to a sharp decline of the mean squared error compared to other choices of Λ. These findings imply that despite the bias due to accepting the non-probability sample, the impact would be less compared to the increased variance due to rejecting the non-probability sample. But care is needed to determine the amount of information borrowed from the non-probability sample since a small deviation from the optimal value Λ * can lead to a non-ignorable increase of the mean squared error. Once the optimal mean squared error is reached at (Λ * , c * γ ), the further increment of c γ will not be influential.
c) In the rightmost plot, where H 0 is strongly violated, the mean squared error behaves differently as in (A) and (B). It is advisable to choose both Λ and c γ close to zero (the low probability of combining the non-probability sample with the probability sample) to minimize the mean squared error. As above, keeping increasing c γ after the mean squared error flattens out is of no importance.  Each column of the plots corresponds to a different metric: "bias" for bias, "var" for variance, "MSE" for mean square error.

B.3 Additional simulation results
samples. Figure B.2 presents the plots of Monte Carlo biases, variances and mean squared errors of the µ A , µ bc , µ eff , µ tap and µ tap:fix based on 2000 replicated datasets. For the fixed threshold strategy µ tap:fix , the threshold c γ is held fixed to be the 95th quantile of a χ 2 1 distribution (i.e., 3.84) and the tuning parameter Λ is selected by minimizing the asymptotic mean square error at the fixed c γ .
In Table B.1, we find that the adaptive procedure tends to select smaller values of Λ and c γ as b increases. As a result, the Monte Carlo proportions of combining the probability and non-probability samples together are decreasing, which is desired for down-weighting the biased non-probability sample. Moreover, we compare the adaptive tuning strategy of c γ with a fixed thresholding strategy, and Figure B.2 shows that the strategy with pre-defined cutoff cannot satisfactorily control the mean squared error when H 0 is slightly or strongly violated.

B.4 Double-bootstrap procedure for v n v n selection
Following the algorithm mentioned by Chakraborty et al. (2013), where optimal v n is selected to ensure the coverage probability, we need to retain the K bootstrapped samples, called V (1) , i ) T : i ∈ 1, · · · , n}, b = 1, · · · , K with n = n A + n B . The reason it is called double bootstrap is that each bootstrap sample spawns itself to a set of K ′ second-order bootstrap samples. Next, we set up the candidates for v n . Under the assumption (A2), we let v n be the form of κ log log n with κ ∈ {2, 4, 10, 20, 30}, and construct the bound-based adaptive confidence intervals for each given κ at 1 − α confidence level, denoted as C PACI,κ µg,1−α (a). Given each κ, we compute the coverage probability for the associated adaptive confidence intervals regarding these K ′ second-ordered simulated datasets. Then, choose the smallest κ that ensures the actual coverage probability larger than 1 − α. Specifically, we use the estimator µ

(b)
A for µ A in each bootstrapped dataset as the ground truth and count the number of datasets in which the adaptive confidence interval covers the ground truth, A ∈ C PACI,κ,(b) µg,1−α (a)} and therefore the v n can be determined by using v n = inf{κ : c(κ)/K ′ > 1 − α} × log log n. In our simulation, K ′ is set to be 100.

B.5 Details of the Bayesian method
In this section, we provide the details of the Bayesian approaches proposed by Sakshaug et al. (2019) to combine the probability and non-probability samples as follows.
1. Solve the score function for β by using the non-probability sample: 2. Construct the informative prior with three choices: Prior 1: Choose a weakly informative parameterization of the prior as β ∼ N (0, 10 6 ), which can be treated as a reference for comparison.
Prior 2: Let β PR be the solution to the score function based on the probability sample Then consider the squared Euclidean distance between β PR and β NPR as the hyperparameter σ 2 β for the variance of β: Prior 3: In lieu of using the squared distance to extract information on σ 2 β , a nonparametric with-replacement bootstrap procedure can be implemented (B = 1000). After estimating the coefficient in each of them, denoted by β (i) NPR , one replication-based variance estimator can be obtained, NPR . Then, the informative prior can be constructed β ∼ N ( β NPR , I p×p · σ 2 β NPR ).