Bias Correction in Clustered Underreported Data

Data quality from poor and socially deprived regions have given rise to many statistical challenges. One of them is the underreporting of vital events leading to biased estimates for the associated risks. To deal with underreported count data, models based on compound Poisson distributions have been commonly assumed. To be identifiable, such models usually require extra and strong information about the probability of reporting the event in all areas of interest, which is not always available. We introduce a novel approach for the compound Poisson model assuming that the areas are clustered according to their data quality. We leverage these clusters to create a hierarchical structure in which the reporting probabilities decrease as we move from the best group to the worst ones. We obtain constraints for model identifiability and prove that only prior information about the reporting probability in areas experiencing the best data quality is required. Several approaches to model the uncertainty about the reporting probabilities are presented, including reference priors. Different features regarding the proposed methodology are studied through simulation. We apply our model to map the early neonatal mortality risks in Minas Gerais, a Brazilian state that presents heterogeneous characteristics and a relevant socio-economical inequality.


Introduction
The estimation of economic, health and social indicators in underdeveloped and developing countries has been a challenging task due to the low quality of the available data. In such areas, even with the recent advances, data coming from official collection systems usually experience considerable underreporting of events. To cite an example, it is common to miss the report of infants who die shortly after birth. If not accounted for, such phenomena typically lead to the underestimation of vital statistics, compromising the definition of appropriate government intervention policies and distribution of financial resources.
In the statistical literature, the bias problem induced by a defective data reporting process is commonly handled by considering hierarchical models that accommodate truncated or censored observations. For mapping the risks associated to count events subjected to underreporting, Bailey et al. (2005) consider the censored Poisson regression model proposed by Caudill and Mixon Jr. (1995) assuming that, for suspected areas, the observed count represents a right-censoring threshold for the true non-observed total number of events. This approach relies on the fact that, a priori, all areas experiencing underreporting are precisely known. Bailey et al. (2005) consider ad-hoc procedures to determine the censored (underreported) areas. Later, Oliveira, Loschi, and Assunção (2017) define a random-censoring Poisson model (RCPM) introducing more flexibility in the analysis of underreported count data. The RCPM allows for the estimation of both the associated occurrence rates and the probability of each area to experience censoring. The authors shown that quality of posterior estimates is related to the availability of informative prior distributions for the censoring probabilities. The compound Poisson model (CPM) is an alternative approach to deal with potentially underreported counts. It allows for the joint modeling of the event occurrence rates and the associated reporting probabilities. The main difference between RCPM and CPM is that the former models the underreporting status of each area: Is area i suffering from underreporting or not? In turn, the latter models the area-specific probability of each particular event being reported, then all areas are, in principle, subject to underreporting.
To guarantee the CPM identifiability, it is necessary to introduce prior information on the reporting process. This has been carried out in different ways in the literature depending on the context and the type of information available. For example, Whittemore and Gong (1991), Stamey, Young, and Boese (2006) and Dvorzak and Wagner (2015) resort to a validation dataset on the reporting process. This refers to another independent data source, free of underreporting, that can be used to calibrate the bias induced by the underreporting in the main dataset under analysis. Such additional gold standard dataset does not necessarily have to be on the same scale as the primary data but it has to be available for each sample unit. Thus, validation datasets are rarely available and they can be very expensive to obtain. All three previous papers use the same illustrative example which is based on a single validation dataset of severely restrictive extent. Specifically, their validation dataset is based on a 1987 study that selected a sample of 203 physicians divided in four groups according to their nationality (England, Belgium, France, and Italy). In each group, the sample of physicians was asked to complete a specimen death certificate for the case history of a single 51-year-old woman with an ulcerating tumor of the cervix. The certificate had enough information to induce the correct classification of the patient as a victim of cervical cancer. However, the groups reached different proportions of death certificates correctly coded as cervical cancer. The result is then used as a gold-standard estimation of the correct diagnosis and completion of death certificates for this specific cancer as the underlying cause of death. Hence, this validation dataset is outdated and should be looked cautiously if used for recent death data. Furthermore, it is useful only for one single cancer (cervical cancer) in four specific countries, being hardly generalizable for other sorts of cancers or other regions. Moreno and Girón (1998) resort to a different strategy as they did not have a validation dataset in their study of reported assaults in Málaga and Stockholm. They provide a detailed investigation under the CPM whenever conjugate families are considered to independently model the prior uncertainty for the reporting probabilities and the occurrence rates. The authors emphasize that prior information on the reporting probabilities is expected to be included to make the posterior distribution estimation feasible. Such information can be obtained through specific surveys or from experts' opinion and then be conveniently used to set the hyperparameters of the conjugate prior distributions. Following Moreno and Girón (1998)'s approach, Schmertmann and Gonzaga (2018) consider the CPM to estimate the age-specific mortality and life expectancy for small areas with defective vital records in Brazil. Probabilistic prior information on the death registration coverage in each area is considered to elicit an informative Beta prior distribution for the death reporting probability in three age groups. The authors derived such a prior information from standard demographic estimation techniques, such as the Death Distribution Methods, and also from intensive field audits conducted by Brazilian public health researchers.
As an alternative to this previous models, Stoner, Economou, and Drummond (2019) present a Bayesian hierarchical CPM to account for the underreporting in tuberculosis counts in Brazil. To complement the partial information in the data, their model only requires an informative prior distribution for the mean reporting rate. To elicit such an informative prior across all Brazilian microregions, the authors consider external estimates of the overall tuberculosis detection rate derived by the World Health Organization through an inventory study.
Trustful prior information about the overall mean reporting process is not always available. Sometimes, one counts only with pieces of prior information on the reporting process for some subsets of areas, obtained through local inventory studies (local active search for cases) or experts' opinion. In many epidemiological studies, for example, one only knows a priori that the severity levels of underreporting are likely associated with some socioeconomic indicators or, merely, that less socially deprived areas properly record a greater percentage of their events, producing more reliable information (see Campos, Loschi, and França, 2007;Bailey et al., 2005;Silva et al., 2017, for instance). That is the case, for example, when mapping the infant mortality rates in underdeveloped regions, such as Africa and Latin America, based on data coming from defective death registration systems (World Health Organization, 2006;Alkema and New, 2014;Alexander and Alkema, 2018). Inspired by situations in which validation datasets are unaccessible and reliable prior information about the reporting process is only available for areas experiencing the best data quality, we propose a new hierarchical Bayesian approach for the CPM (Section 2). It considers that the areas composing the region of interest are ordered according to data quality categories. If it is reasonable to additionally cluster the areas into homogeneous groups, then the model becomes more parsimonious. The clusters may be defined based on experts' opinion or applying some clustering technique to data quality indicators provided by previous studies and surveys. In our model, the data quality clustering of the areas is a tool used to model their reporting probabilities. We leverage the clusters to create a hierarchical structure in which the reporting probabilities decrease as we move from the best data quality areas to the worst ones. The novelty in our approach is that only an informative prior distribution about the reporting probability at areas experiencing the best data quality is required to ensure identifiability (Section 2.1). To model the event occurrence rates, we consider a set of potential covariates through a regression structure. Bayesian variable selection is incorporated into the model to identify regressors with a non-zero effect.
Extensive simulation studies are presented to evaluate the performance of the proposed model in different scenarios (Section 3). We apply the developed Bayesian methodology to estimate the early neonatal mortality rates in Minas Gerais State, Brazil, for the periods 1999-2001and 2009-2011, where the death counts are known to be underreported (Campos, Loschi, and França, 2007). In this context, the proposed approach is attractive because neither validation datasets nor prior knowledge about the overall mean reporting probability is available. Section 5 closes the paper with our main conclusions.

Model specification
Consider a region divided into A areas and denote by T i the total number of events at area i for where λ i is the mean expected counts in the ith area. The relative risk for the event at area i is given by is a known offset quantity representing the expected number of events in such area. The offset E i allows for a variation in the population size over the areas. In the context of underreported data, T i is not fully observed for, at least, part of the areas, so that the reported number of events Y i corresponds only to a fraction of T i . To consider this data feature, each event occurring in the ith area is associated to a binary random variable Z t ind ∼ Bernoulli( i ), t = 1, . . . , T i , that determines whether the event will be reported or not, where i ∈ [0, 1] represents the associated reporting probability. The random variables in the sequence Z 1 , Z 2 , . . . , Z Ti are assumed as being identically distributed, mutually independent and also indepen- The model in expression (2.1) is called compound Poisson model (CPM). To model the relative risks we assume that θ = (θ 1 , . . . , θ A ) is related to a set of p potential covariates such that log(θ i ) = β 0 + β 1 X 1i + · · · + β p X pi , i = 1, . . . , A. Random effects may be included in the log-linear predictor to capture any residual spatial or local variation in the relative risk. The greatest challenge under the CPM is the modeling of the reporting probabilities = ( 1 , . . . , A ). If no further information is available, only the parameter η i = θ i i is identified from the observed data Y i since any parameter combination such thatθ i˜ i = θ i i yields the same likelihood function. Different approaches to model have been discussed in the literature. Moreno and Girón (1998) and Schmertmann and Gonzaga (2018) directly model the uncertainty about i using informative beta prior distributions. A more general approach assumes that i = g (H 1i , . . . , H mi ), where H 1i , . . . , H mi is a set of m covariates related to the reporting process and g is any non-negative function such that 0 < g(H 1i , . . . , H mi ) < 1 for all i. There are many possible choices for g. The most popular one is to assume that g is a logistic function, as in Whittemore and Gong (1991), Dvorzak and Wagner (2015) and Stoner, Economou, and Drummond (2019). As discussed in Section 1, all these approaches require either strong prior information about each i or validation datasets to ensure model identifiability.
One of the main goals in this work is to model in situations where no validation dataset is available to guarantee model identifiability and whenever reliable prior information about the percentage of underreporting is only available for areas where data are known to be better reported. In this context, we assume that i = g (H 1i , . . . , H mi is the basal underreporting probability in the area with the best data quality and f is any non-negative function such that f (H 1i , . . . , H mi ) < 1 − γ for all i. The function f captures the increase in the basal underreporting probability explained by the covariates. If f equals to zero in the best area, then f (H 1l , . . . , H ml ) denotes the increase in the underreporting probability for area l when compared to the best one. As in the model proposed by Stoner, Economou, and Drummond (2019), covariates H 1i , . . . , H mi are assumed to be different from X 1i , . . . , X pi to guarantee model identifiability. This model limitation may be avoided only if validation datasets are accessible as in Dvorzak and Wagner (2015). A further discussion on this issue is given in Section 2.1.
The definition of a general f which satisfies all these constraints is not a simple task. To facilitate its construction, we assume that it is possible to sort the areas according to their data quality. Additionally, we assume that the reporting probabilities are equal for areas where the covariates related to the reporting process experience similar values. For this purpose, we assume that the A areas are grouped into K known data quality clusters, where K ≤ A. We allow for K = A to account for situations in which there is no prior information for clustering the areas. However, if such information is available and K < A, we obtain a more parsimonious model and more data information to estimate the reporting probabilities throughout the areas.
In practice, there are many ways to define the clusters. We may consider some grouping proposals available from previous works or to be guided by experts' information. Another possibility is to perform usual clustering techniques based on available covariates that are related to data quality in the region of interest.
Based on such grouping structure, we use a convenient coding scheme to represent the clustering indicator variable, which is different from variables in X 1i , . . . , X pi . Let h i = (h 1i , . . . , h Ki ) T be the clustering variable composed by binary quantities h 1i , . . . , h Ki and defined according to the following split-coding scheme: if area i belongs to cluster j then h li = 1 for all l ≤ j and h li = 0 otherwise. Let γ = (γ 1 , . . . , γ K ), where γ j ∈ [0, 1) for all j = 1, . . . , K, such that K j=1 γ j < 1. We assume that the reporting probability at area i is given by The constraint imposed on γ is necessary to guarantee that i = 0 ∀ i, and, consequently, to ensure non-null mean for the associated Poisson distribution.
The proposed model has some interesting features. Firstly, to be identifiable, it only requires information about the reporting probabilities in the best areas (see discussion in Section 2.1). Besides that, i is represented in terms of interpretable parameters, which facilitates prior elicitation. For a given area i, h i = (1, 0, . . . , 0) T and h i = (1, 1, . . . , 1) T represent the two most extreme situations. If h i = (1, 0, . . . , 0) T then the ith area has the highest level of data quality. We will assume that data in such area are recorded with a higher probability ( i = 1 − γ 1 ) if compared to the areas in the remaining data quality levels. At the other extreme, if h i = (1, 1, . . . , 1) T then the ith area lies in the worst data quality category. Data in this region are recorded with a lower probability ( i = 1− γ 1 −· · ·−γ K ) if compared to those areas belonging to clusters with better data quality. Thus, the parameter γ 1 represents the probability of not recording an event in areas classified in the highest level of data quality. The parameter γ 2 is the increment on such probability for areas experiencing the second highest data quality level, and so on. Another attractive feature of the proposed model is that, although the clustering indicator variable cannot be used to also model the relative risks θ, the covariates used for clustering are indirectly taken into consideration when estimating θ, since the areas belonging to the same cluster are homogeneous w.r.t. such clustering covariates.

On model identifiability
The lack of identifiability of the compound Poisson model presented in expression (2.1) has been discussed by several authors (Whittemore and Gong, 1991;Moreno and Girón, 1998;Stamey, Young, and Boese, 2006;Papadopoulos and Silva, 2012;Dvorzak and Wagner, 2015;Schmertmann and Gonzaga, 2018;Stoner, Economou, and Drummond, 2019). All these previous works impose some constraints on θ and to attain model identifiability.
A well-known way to overcome non-identifiability problems requires extra information about the reporting probabilities = ( 1 , . . . A ). In the most extreme cases, all components of vector should be fixed at a known quantity. Moreno and Girón (1998) and Schmertmann and Gonzaga (2018) show that this extreme constraint may be relaxed when the target of the statistical analysis is to estimate the relative risks. This is done by incorporating external estimates of registration coverage through very informative prior distributions about each component of .
To the best of our knowledge, there are two approaches to obtain an identifiable model when sets of covariates, say X and H, are used to model the relative risk θ and the reporting probability , respectively. The first one requires extra information from independent validation datasets (Whittemore and Gong, 1991;Stamey, Young, and Boese, 2006;Dvorzak and Wagner, 2015). This is a rare situation in practice that, however, does not require the intersection of X and H to be empty. The second one, adopted by Papadopoulos and Silva (2012) and Stoner, Economou, and Drummond (2019), creates some kind of linear separability of the covariates sets X and H. Stoner, Economou, and Drummond (2019) build X and H by splitting the set of all available covariates into two disjoint sets based on experts' opinion. Hence, there is an empty intersection between the covariates in the sets X and H but this is not enough to guarantee identifiability. In their modeling framework, they also had available an informative prior distribution for the overall mean reporting rate which was sufficient to complete the identifiability conditions. Papadopoulos and Silva (2012) allow intersection between the two sets of covariates but impose prior information to establish appropriate constraints on the parametric space, such as restrictions on the signs or exclusion of some coefficients. This avoids the need for validation datasets.
Our approach also assumes, as in Stoner, Economou, and Drummond (2019), that the clustering covariates associated with are not considered in the log-linear predictor of the relative risks θ. In principle, this constraint seems to be quite restrictive. Nevertheless, for model identifiability, what is required is the lack of strict mathematical collinearity between X and H, but not their statistical independence. Thus, the two disjoint sets X and H may be correlated. In many practical situations, we can and probably will have the two sets composed by covariates carrying similar information, measuring related aspects of the areas. For instance, to estimate infant mortality rates, one expects that poor social-economic conditions will affect both the relative risks and the reporting probabilities. It is true that to avoid the identifiability issues we must not use the same covariates when modeling θ and . However, we are allowed to use correlated variables, since our identifiability assumption requires just the strict empty intersection between the two sets, not the orthogonality of the information they carry. This make our model much more attractive for practical implementation with respect to the previously proposed alternatives.
If the number of clusters K is smaller than the initial number of areas A, the clustering strategy proposed in expression (2.2) imposes a reduction in the parametric space related to the CPM in expression (2.1). Even under such a reduction and assuming that X and H are disjoint sets, the proposed CPM remains unidentifiable. Its identification will depend on the only trustful prior information we have available: the percentage of data reporting in areas with the best data quality. Nevertheless, if such a piece of information is not available, other constraints for model identification are possible as discussed in the following.
Assume log(θ i ) = β 0 + β 1 X 1i + · · · + β p X pi , i = 1, . . . , A, and denote by A j the subset of areas belonging to the j-th data quality cluster, for j = 1, . . . , K. Under these assumptions, the log-likelihood function associated with the proposed model is where Ψ = (β 0 , β 1 , . . . , β p , γ 1 , . . . , γ K ). As the proposed model belongs to the exponential family, we obtain that y i is the (p + K + 1)-dimensional sufficient statistic for the parameter vector Ψ. Note that, the first coordinate of vector T (y) is a linear combination of the last K coordinates. Thus, the number of unknown parameters exceeds by one the number of linearly independent pieces of sample information (sufficient statistics). This implies that only p + K parameters can be estimated without additional information (McHugh, 1956;Picci, 1977;Huang, 2005).

Proposition 2.1. The proposed model under the specification in expression (2.3) is identifiable if β 0 or one of the coordinates of vector γ is fixed at a known value.
Proof. Firstly, fix β 0 at a known value. In this case, model identifiability follows by noticing that the vector of sufficient statistics associated to the parameter vector Ψ Similarly, without losing generality, let γ 1 to be known. Under this assumption the sufficient statistics related to the parameter vector Ψ * * = (β 0 , β 1 , . . . , β p , γ 2 , . . . , γ K ) are given in In this case, the proof follows straightforwardly by noticing that the first coordinate of T * * (y) can not be recovered as a linear combination of the last p + K − 1 coordinates as it depends on i∈A1 y i .
Our proposal is to approach situations in which trustful prior information is only available about γ 1 . This parameter is easily interpretable as the underreporting probability in those areas having the best data quality. Thus, only prior information about the proportion of unrecorded data in such areas is required to identify the proposed model. Despite its appealing interpretation, the precise choice of the value for γ 1 may not be a simple task in practical situations. However, it is possible to obtain from experts some pieces of information about the most likely values for such parameter. This information may be suitably expressed by means of a non-degenerated informative prior distribution for γ 1 thus relaxing the requirement of exactly knowing its value (for further discussion on the use of prior information to attain model identification see Gustafson et al. (2005)).
Another way to investigate model identifiability is to consider the associated Fisher information. The Fisher information plays an important role in the asymptotic theory of maximum likelihood estimation as well as in Bayesian reference analysis. Besides that, Rothenberg (1971) showed that a model that belongs to the exponential family is glob- The K × K sub-matrix highlighted in bold will be considered in Section 2.2 to build the Jeffreys prior for γ given β = (β 0 , β 1 , . . . , β p ).

Proposition 2.2. The Fisher matrix information I(Ψ) associated with the model given in expressions (2.1) and (2.2) is singular.
Proof. Denote by C κ the column vector of I(Ψ) associated to the parameter κ ∈ Ψ such that I Thus, I(Ψ) is a singular matrix and, from Theorem 3 in (Rothenberg, 1971), it follows that the associated statistical model is not globally identifiable.
As previously shown in Proposition 2.1, model identifiability is achieved provided that the parameter γ 1 is fixed at a known value. In the general case, it is difficult to prove directly that the Fisher information matrix I(Ψ) is nonsingular when we fix γ 1 . However, some special cases are amenable to analytic treatment and they are illuminating for this identifiability discussion as shown in Proposition 2.3.

] then the Fisher information matrix associated with the model given in expressions (2.1) and (2.2) is nonsingular.
Proof. The Fisher information matrix I(Ψ * ) under this model specification is obtained from I(Ψ) by removing the columns and rows related to parameters β 1 , . . . , β p and γ 1 and setting γ 1 = γ 0 1 . After some calculation, we obtain that the determinant of I(Ψ * ) is All sum terms in det I(Ψ * ) are positive. Consequently, we have det I(Ψ * ) > 0 implying that I(Ψ) * is a nonsingular matrix. From Theorem 3 in (Rothenberg, 1971), it follows that the associated statistical model is globally identifiable.
The previous propositions provide some mathematical constraints for model identifiability, which are necessary to guarantee that all parameters can be estimated from the observed data. Such constraints do not guarantee, however, that all parameters will be well estimated, that is, having theoretical identifiability may not guarantee the practical identifiability. Even for an identifiable model, large sample sizes might be required to obtain good parameter estimates in some situations. On the other hand, for a nonidentifiable model, some parameters might not be estimated even with large datasets if the identifiability constraints are not considered.

2). Under this parametrization, the likelihood function is given by
Concerning the model identification, the parametrization in (2.4) is quite attractive as it leads to a regular Poisson generalized linear model (GLM). By framing the model as a GLM, the conditions for model identification are easily found, especially the requirement that θ and are associated with disjoint sets of covariates. Also, as the first component of h i is equal to 1 for all i, such parameterization makes it clear that δ 1 works like a second intercept for which an informative prior must be elicited. However, such a parametrization brings some additional challenges to model the uncertainty about . While γ j has a clear and meaningful interpretation for practitioners, δ j is interpreted as the ratio between the reporting probability in cluster j and cluster j − 1 in the log scale for j > 1. As for δ 1 , it is the log of the proportion of recorded data in the best cluster in relation to a scenario with perfectly recorded data. We also have a challenge regarding the appropriate prior specification for δ. To ensure a valid Poisson model we must have δ j > 0 for all j. As, a priori, we only have trustful information about 1 and we know that 0 < K ≤ K−1 ≤ · · · ≤ 2 ≤ 1 ≤ 1, we can not simply assume independent positive distributions for the δs. Notice that δ 1 = log(1) − log ( 1 ) and δ l = log ( l−1 )−log ( l ), for l = 2, . . . , K. Then, we must transform the prior information of 1 to the log-scale and use it to build a distribution with positive support for δ 1 . Then, the prior distribution of δ 2 should be such that the distribution of δ 2 + δ 1 = − log( 2 ) is a truncated distribution putting all probability mass in values higher than δ 1 . Similar constraints should be imposed to the prior distributions of the remaining δs.

Modeling the prior uncertainty about γ
As a starting point, we could consider independent informative Beta distributions by eliciting γ j ind ∼ Beta(α j , ν j ), j = 1, . . . , K, where the hyperparameters α j > 0 and ν j > 0 should be elicited by experts. This strategy was considered by Schmertmann and Gonzaga (2018) in their particular application to estimate age-mortality rates in Brazil. This is a cumbersome approach as it might lead to some difficulties in the computational implementation of our model. First of all, the constraint K j=1 γ j < 1 should be satisfied since some events are recorded even in areas belonging to the worst data quality cluster and, to have a valid Poisson model, i must be non-null for all i. Furthermore, some dependence among the γ j 's is desirable. To deal with the first problem, we may consider a Dirichlet distribution on the augmented vector (γ 1 , . . . , γ K , γ K+1 ), where γ K+1 = 1 − K j=1 γ j is the percentage of data recorded in the worst cluster. More interestingly, both issues may be jointly addressed as described below. We propose considering a joint prior for γ = (γ 1 , . . . , γ K ) based on the generalized Beta distribution as follows: (2.5) where GBeta(α, ν; a, b) denotes the generalized Beta distribution with probability den- The generalized Beta distribution can be obtained as the linear transformation X = a + (b − a)B, where B ∼ Beta(α, ν). By letting a j = 0 and a * j = 1 for all j = 1, . . . , K, the prior distribution in expression (2.5) is the well-known stick-breaking representation of the Dirichlet process, in which we consider independent random variables Z j ∼ Beta(α j , ν j ), j = 1, . . . , K, and we let γ 1 = Z 1 and γ j = Z j j−1 l=1 (1 − Z l ) for j = 2, . . . , K. This is an advantageous feature we consider to facilitate the computational implementation of the generalized Beta prior distribution.
If we set α j = ν j = 1 for j = 1, . . . , K, and 0 ≤ a j < a * j ≤ 1, j = 1, . . . , K, the conditional prior distributions given in expression (2.5) corresponds to a simpler model which is based on conditional uniform distributions so that (2.6) The uniform prior distribution in expression (2.6) is more parsimonious and easier to be elicited. In turn, the generalized Beta prior distribution in expression (2.5) is more flexible and provides different shapes for the marginal prior distribution of each γ j . Thus, the choice between the prior distributions given by expressions (2.5) and (2.6) will depend on the information that the practitioner has available. In practice, the choice of all prior hyperparameters might be driven by experts' opinion or guided by results of previous studies. Special attention, however, should be given to the prior distribution of γ 1 as it plays an important role in the proposed model identification. As discussed in Section 2.1, it has to be informative, putting a significant probability mass in the subset of the parametric space indicated by the experts as containing the most likely values for such parameter.
Independently of the prior that is chosen for γ, the generalized Beta or the particular case of the conditional uniform, by assuming the structure in expression (2.2), the increment in the underreporting probability associated with each cluster j amounts just to a fraction of what is left after considering the probabilities of the previous (better) groups. Thus, the prior distribution for i outside the best cluster inherits the prior information for the reporting probability in the best areas.
The unconditional prior expectation and variance of i are useful whenever an informative prior distribution for γ 1 or any other component of parameter vector γ is to be elicited. Assuming the distribution in (2.5), respectively, the prior unconditional expectation and variance of i , for all i ∈ A j , i.e., all areas classified in the jth data quality cluster, for j = 1, . . . , K, are where c l = a l +(a * l −a l )α l [α l +ν l ] −1 and d l = (a * l − a l ) 2 α l ν l (α l + ν l ) 2 (α l + ν l + 1) −1 .
For the particular case in which a j = 0 and a * j = 1 for all j, it follows that Similar results under the conditional uniform prior distribution in expression (2.6) are provided in the Supplementary Material (Oliveira et al., 2020).
Another way to model the prior uncertainty about the model parameters is to consider the Jeffreys' approach (Jeffreys, 1946).
We assume that, a priori, γ is independent of β = (β 0 , β 1 , . . . , β p ) and we only focus on the Jeffreys prior for γ. The Fisher information matrix for the vector γ, given β, is the bottom right K ×K submatrix highlighted in bold in I(Ψ) which is given in Section 2.1. Consequently, the Jeffreys prior distribution for γ becomes (2.7) The challenge is to prove that the prior in expression (2.7) is a proper distribution and to investigate the level of prior information about γ 1 that is induced by the Jeffreys prior.
Proposition 2.4. The Jeffreys prior distribution for γ given in expression (2.7) is proper.
Assuming the Jeffreys prior in expression (2.7), the prior expected value of γ 1 is 0.6667 and its marginal prior distribution concentrates most probability mass around large values (see Figure 1). It is expected that such prior does not provide good posterior estimates for the model parameters whenever the true percentage of underreported events in areas with the best data quality is small and far from that prior expected value. Particularly, it is not an appropriate prior to model the uncertainty about γ 1 in the case study addressed in the paper where the probability of underreporting in the best areas is expected to be close to zero. To illustrate the effect of the marginal Jeffreys prior distribution of γ 1 on the joint Jeffreys prior for γ, we present in Figure 1 the joint Jeffreys prior distribution for parameters γ 1 and γ 2 . As the prior associated to γ 1 is centered around large values, the most probable prior values for the vector (γ 1 , γ 2 ) are associated to large values for γ 1 and small values for γ 2 .

Modeling the prior uncertainty about θ
To model the uncertainty about the relative risk θ, assume that p covariates are available such that log(θ i ) = β 0 + β 1 X 1i + · · · + β p X pi , i = 1, . . . , A. The intercept β 0 represents a common term affecting the risk of all areas with prior N (0, σ 2 β0 ). To model the prior uncertainty about the fixed effects β = (β 1 , . . . , β p ), we assume that β | Σ β ∼ N p (0, Σ β ), where N p denotes the p-variate Gaussian distribution and Σ β = diag{σ 2 1 , . . . , σ 2 p }. It is also appealing to consider some technique to perform Bayesian variable selection. The goal is to identify covariates that are statistically significant (non-zero effect) to explain the relative risks. The stochastic search variable selection (SSVS) method, proposed by George and McCulloch (1993), assigns a spike-slab mixture of Gaussian distributions to the fixed effects β. The spike element concentrates closely around zero, reflecting whether the covariate should be included in the model. The slab component has a sufficiently large variance to allow the effect to spread over larger values. Thus, to complete the SSVS prior specification we, additionally, assume that where δ x (·) denotes the Kronecker delta concentrated at point x and the hyperparameters σ 2 slab , σ 2 spike and ρ m , for m = 1, . . . , p, should be specified. To allow for local differences among the risks, apart from the covariates pattern, a more complete model with regional effects u = (u 1 , . . . , u A ) can be considered in the log-linear regression by assuming that u i iid ∼ N(0, σ 2 u ), i = 1, . . . , A. Spatial effects s = (s 1 , . . . , s A ) that quantify the influence of neighboring areas can also be added into the regression structure such that log(θ i ) = β 0 + X i β + s i + u i . The usual joint prior distribution of s is the intrinsic conditional autoregressive distribution (ICAR) with variance parameter σ 2 s (see Besag, York, and Mollié (1991) for details on the ICAR prior definition). We further assume that the model variance parameters are such that σ 2 s ∼ IG(a s , d s ) and σ 2 u ∼ IG(a u , d u ), where IG denotes the Inverse-Gamma distribution. The parameters β 0 , β, u and s are considered as being independent.
Assuming the prior distributions discussed in this section, the joint posterior distribution for all model parameters is not known in closed form. Posterior inference can be carried out through a Markov chain Monte Carlo (MCMC) scheme. The posterior full conditional distributions that can be considered for sampling from the joint posterior distribution are given in the Supplementary Material (Oliveira et al., 2020).

Simulated data studies
In this section, we investigate the performance of the proposed model through Monte Carlo simulations. To mimic our case study presented in Section 4, we consider the map of Minas Gerais State that is composed of A = 75 areas. A total of 100 datasets are generated from Poisson distributions such that Y i ind ∼ Poisson(E i θ i i ), for i = 1, . . . , 75, where i = 1 − h T i γ and the expected number of cases E i is known and equal to the one available for the case study. We also consider the same clustering indicator variable used in the case study, which has K = 4 data quality categories (clusters), partitioning the map in groups with a total of 28, 16, 14 and 17 areas, respectively, from the best to the worst category. This clustering variable is based on the adequacy index (AI) introduced by França et al. (2006). We provide a detailed explanation of the clustering construction in Section 4. We set γ = (0.05, 0.10, 0.15, 0.20) imposing that 5% of events are not reported in those areas classified at the highest level of data quality whereas only 50% of events are reported in those areas belonging to the worst data quality cluster. To generate the relative risks, we consider independent observations from five covariates such that log(θ i ) = β 0 + β 1 X 1i + · · · + β 5 X 5i , where β 0 = 0.50 and β = (−0.25, −0.25, 0, 0, 0.25). These covariates are different from the clustering covariate. They were selected from our real dataset such that part of them are correlated with the clustering covariate. All covariates considered here are provided in the Supplementary Material (Oliveira et al., 2020).
When fitting the simulated datasets, three different structures are considered for the relative risk θ. In Model 1, we let log(θ i ) = β 0 + β 1 X 1i + · · · + β 5 X 5i , where β m iid ∼ N (0, 10) for m = 0, . . . , 5. Model 2 differs from Model 1 by considering a variable selection scheme on the set of covariates through the SSVS prior distribution given in expression (2.8) with σ 2 spike = 0.001, σ 2 slab = 10 and ρ m = 0.5 for m = 1, . . . , 5. Model 3 differs from Model 2 by the inclusion of both local and spatially structured random effects in the log-linear regression such that log(θ is the local effect of area i and s = (s 1 , . . . , s A ) denotes the spatial effects having the ICAR prior distribution (Besag, York, and Mollié, 1991) with precision parameter τ s = σ −2 s . The neighboring structure inherent to the map of case study in Section 4 is adopted to model the spatial effects s and we further assume that the model precision parameters are modeled as 1/σ 2 s ∼ Gamma(0.5, 0.0005) and 1/σ 2 u ∼ Gamma(2, 0.01). The prior specification for γ differs throughout the simulation studies and it will be properly described in each case. Basically, the joint prior distributions given in expressions (2.5) and (2.6) are elicited with different levels of information, specially focusing on the prior distribution for the parameter γ 1 which is associated with the model identifiability.
Posterior estimates (posterior means) for the relative risks, θ, are compared in terms of bias (Bias), relative mean squared error (RMSE) and the nominal coverage of the 95% highest posterior density intervals (Cov.) averaged over the R = 100  (Oliveira et al., 2020). For each generated dataset, the MCMC scheme considered a total of 100,000 iterations, being the first 50,000 discarded as a burn-in period and a lag of 25 iterations was selected to avoid autocorrelated posterior samples.

Simulation Study I: comparing the generalized Beta and the conditional uniform priors for γ
In this study, we mainly evaluate the sensitivity of the posterior estimates of θ when different degrees of information are assumed in the prior distributions for γ defined in expressions (2.5) and (2.6). In both cases, two different levels of prior information, named partially informative and fully informative, are considered. The partially informative case assumes an informative prior only for the parameter γ 1 . Here, that is attained by choosing hyperparameters such that the prior π(γ 1 ) is centered and highly concentrated around the true value of γ 1 . We elicited γ 1 ∼ GB(2.9, 55.1; 0, 1) under the generalized Beta prior and γ 1 ∼ U (0, 0.10) under the conditional uniform prior. For all remaining γ j , j = 2, . . . , 4, the associated prior distribution assumes a j = 0, a * j = 1 and, additionally for the generalized Beta case, it also considers α j = ν j = 1. By doing so, we impose a strong constraint on the reporting probability associated to areas belonging to the best data quality cluster but, for all the remaining areas, the only prior information is the one inherited from the prior of γ 1 . Finally, in the case of fully informative prior distributions, all hyperparameters a j and a * j and, additionally α j and ν * j in the generalized Beta case, are chosen such that π(γ j ) is centered and highly concentrated around the true value of γ j , for j = 1, . . . , 4. For comparison purposes, we also consider the standard Poisson model which does not take underreporting into account.

RMSE Bias
Cov.   Table 1 summarizes the results. By eliciting an informative prior distribution only for parameter γ 1 (partially informative case), the proposed model provides good posterior estimates for the risks with bias and RMSE close of zero. The results are quite close to those obtained under informative prior for all components of parameter vector γ (fully informative case). In general, we observe a slight difference between results obtained under the generalized Beta prior and the conditional uniform distributions for γ, where the former has a greater number of hyperparameters to be chosen. Results under Models 1-3 are quite similar showing that spatial and local effects do not significantly influence the posterior inferences. This is an interesting result as the data are generated from a model that does not include any spatial or local correlation. It should be also mentioned that the non-significant (null) effect of covariates X 3 and X 4 (results not shown) is well identified even under Model 1 which does not consider variable selection.

RMSE Bias
Table 1 also shows that, as expected, the standard Poisson model fails in estimating the relative risks, θ, whenever applied to analyze underreported data. It produces very poor estimates, always underestimating the relative risks no matter the structure imposed to model them. The RMSE under such a model is reasonably small but the 95% credible intervals do not contain the true value of the relative risk for the great majority of the Monte Carlo replications, which means that the posterior distribution for θ tends to put negligible probability mass around its true value.

Simulation Study II: effect of the prior uncertainty about γ 1
The prior distribution for parameter γ 1 plays an important role in model identification and, consequently, in the quality of the posterior estimates. In this section, we reexamine the datasets considered in Section 3.1 fitting the proposed model with different partially informative prior distributions for γ, that is, an informative prior distribution is considered only for the component γ 1 . A sensitivity analysis is performed in order to evaluate the influence of such prior distribution on the posterior inference.
The evaluation metrics for the posterior estimates of θ under six different conditional uniform priors for γ 1 (Table 2) show that the relative risks tend to be underestimated if, a priori, we elicited γ 1 ∼ U (0.0, 0.01) and γ 1 ∼ U (0.0, 0.05). Such prior distributions put all probability mass below 0.05 which is the true value of γ 1 . On the other hand, the risks tend to be overestimated whenever the prior expectation exceeds the true value of γ 1 . The highest the difference between the prior expectation E(γ 1 ) and the true value of γ 1 , the highest are the bias and RMSE of the posterior estimates of θ. This is not a surprising result and it evidences the importance of searching for reliable prior information about parameter γ 1 in practical situations.
Table 2 also shows that, if we assume γ 1 ∼ U (0, 0.05) or γ 1 ∼ U (0, 0.15), the prior means differ from the true value of γ 1 by the same amount. Although the latter prior imposes much higher prior variance than the former, the posterior estimates present similar absolute values for the bias and the RMSE in both cases. This suggests that quality of posterior estimates under the proposed model are more strongly related to the prior expectation of γ 1 than to its prior variance. Such an idea is supported by the results in Table 3 which exhibits some evaluation metrics related to posterior inference for θ assuming different partially informative generalized Beta prior distributions for γ. In all cases, γ 1 ∼ GB(α 1 , ν 1 ; 0, 1) where hyperparameters α 1 and ν 1 are chosen such that this prior is centered around the true value of γ 1 , that is, E(γ 1 ) = 0.05, but the prior uncertainty about γ 1 varies from 0.00002 to 0.00950.  U(0.0,0.01) γ 1 ∼ U(0.0,0.05) γ 1 ∼ U(0.0,0.15 U(0.0,0.30) γ 1 ∼ U(0.0,0.50) γ 1 ∼ U(0.0,0.70 Table 2: Bias and relative mean squared error (RMSE) for the estimated relative risks θ assuming partially informative conditional uniform priors to γ with six levels of prior information on γ 1 (E(γ 1 ) and V (γ 1 ) are different in all cases); Simulation Study II.
distributions with V (γ 1 ) = 0.00024 and V (γ 1 ) = 0.00226 are assumed, the biases are much smaller than those observed in Table 2 under priors γ 1 ∼ U (0.0, 0.05) and γ 1 ∼ U (0.0, 0.15) whose variances are similar (respectively, V (γ 1 ) = 0.00021 and V (γ 1 ) = 0.00188). Moreover, the bias and RMSE under the prior U (0.0, 0.30), which has variance equal to 0.0075, are much higher than those obtained when assuming a generalized Beta prior with a variance equal to 0.0095. In summary, these findings provide more evidence that the posterior inference is more influenced by the prior expectation of γ 1 than by its prior variance.  Table 3: Bias and relative mean squared error (RMSE) for the estimated relative risks θ assuming partially informative generalized Beta priors to γ with six distinct levels of information on γ 1 (E(γ 1 ) = 0.05 (true γ 1 ) and a different prior variance in each case); Simulation Study II. Table 4 exhibits the averaged posterior means for parameters β 0 , β, γ and ω under three out of the different partially informative conditional uniform prior distributions for γ 1 considered in previous studies. Results for Models 1-3 are quite similar, thus we only present the results obtained under Model 3. The vector of fixed effects β and variable selection parameter ω are well estimated regardless of the prior distribution elicited for γ 1 but very little is learned about γ 1 from the data. The posterior mean of γ 1 tends to be close to its prior expectation, reinforcing the importance of obtaining reliable prior information about this parameter. Posterior estimates for the remaining components of γ become worse as the prior expectation of γ 1 gets far from its true value and the prior variance of γ 1 increases.  Table 4: Averaged posterior means of β 0 , β, γ and ω under three different prior specifications on parameter γ 1 ; Simulation Study II.

Parameter True
Goodness of posterior estimation for parameters β 0 and γ 1 are closely related, which is not a surprising result given the identifiability issues discussed in Section 2.1. The intercept β 0 is overestimated (resp., underestimated) if γ 1 is also overestimated (resp., underestimated). Since β 0 directly affects the estimation of the relative risks θ, by overestimating (resp., underestimating) β 0 , the relative risks θ is overestimated (resp., underestimated) inducing the larger positive (resp., negative) bias shown in Table 2.

Simulation Study III: breaking the identification constraints
Our goal here is to show the effect of using the same source of information to model both the relative risk θ and the reporting probability . We consider two different scenarios. In the first one, the same covariate is present in both sets X and H. Consequently, as the constraints for the model identification are not fulfilled, we should have problems to estimate the model parameters. In the second scenario, we will use the same variable but coded in two different ways: In X it is continuous while for H it is considered in a discretized scale obtained by breaking its continuous range into four intervals and coding them with dummy variables. In this case, despite the very strong correlation between X and H, we should obtain good posterior estimates for all model parameters.
We consider the same four clusters used in the previous simulation studies, which are based on a variable called adequacy index (AI) available in our case study (Section 4). In the first scenario, named Categorical AI, the variable AI is considered in its discretized version with four categories indicating the clusters and the variable AI enter in this discretized form in both X and H. In the second scenario, named Continuous AI, its discretized version is maintained in H but, for X, we consider the original continuous AI re-scaled to have a zero mean and a unit standard deviation. To generate the datasets, we set γ = (0.05, 0.10, 0.15, 0.20) and assume the covariates X 1i , . . . , X 4i as in the previous studies. In the Continuous AI scenario, we let log(θ i ) = β 0 + β 1 X 1i + · · · + β 5 X 5i , where X 5i is the AI in its continuous scale, β 0 = 0.15 and β = (−0.25, −0, 25, 0, 0, −0.25). In the Categorical AI scenario, we assume log(θ i ) = β 0 + β 1 X 1i + · · · + β 4 X 4i + β 5,1 D 1i + · · · + β 5,3 D 3i , where β 0 = 0.15 and β = (−0.25, −0.25, 0, 0, 0.25, 0.50, 0.75). The dummy variable D li represents the lth level of the discretized AI for l = 1, 2, 3. To analyze the data, we consider the partially informative conditional uniform prior for γ in which γ 1 ∼ U (0, 0.10).
As expected, Table 5 shows that the posterior inferences for the relative risks are much worse if we break the identifiability constraints (Categorical AI case). However, such estimates do not lose quality if we consider strongly correlated variables to model θ and (Continuous AI case). In the Categorical AI case, Table 6 shows confounding between the parameters γ and the effects of the dummy variables, being all these parameters poorly estimated. This problem is not experienced by the parameters in the Continuous AI case. These findings are in perfect agreement with the theoretical identifiability results discussed in Section 2.1.

Comments on further simulation studies
Section 3 of the Supplementary Material (Oliveira et al., 2020) presents additional simulation studies exploring other features of the proposed model. In the following, we present the main results obtained from such studies. A discussion about the misspecification of the number of data quality clusters, K, is provided in Section 3.1 of the Supplementary Material. In summary, for the simulated datasets, we note that the misspecification of K introduces more bias as well as higher variability in the posterior estimates of θ. Both bias and RMSE are much higher if the number of clusters assumed in when fitting the proposed model is smaller than the true value of K if compared with the case of assuming a value for K that is greater than the true one.
We also evaluate whether the number of areas within the best and worst data quality clusters significantly affects the posterior inference for the relative risks θ (Section 3.2 of the Supplementary Material). In summary, we observed that having a greater number of areas within the best data quality cluster decreases the bias in the posterior estimates of θ. This is an expected behavior since, whenever the number of areas within the best group is larger, the model induces an informative prior for a greater number of areas.
Finally, from the study presented in Section 3.3 of the Supplementary Material, we note that, if the data are correctly recorded (that is, assuming i = 1 ∀ i), the relative risks θ are overestimated under our approach and the bias magnitude depends on the prior knowledge about γ (see Web Table 3). In this context, as expected, the standard Poisson model performs very well presenting both bias and RMSE close to zero. However, the standard Poisson model always underestimates the relative risks if counts are partially recorded (see Table 1 of the main paper), and the bias magnitude depends on the amount of underreporting in the data. Therefore, it is important mentioning that the proposed model shows better results whenever fitted to analyze perfectly recorded data (in terms of bias and RMSE) than the standard Poisson model does whenever fitted to analyze underreported data. In practical situations, the relative risk estimates may guide the definition of government policies for control and intervention. Thus, the underestimation of such quantities leads to undesirable consequences, for instance, if we are mapping disease or mortality risks.

Early neonatal mortality data in Brazil
Our goal here is to map the relative risk of early neonatal mortality (ENM) in Minas Gerais State (MG), Brazil, and also to identify factors that are possibly associated to the event occurrence. The ENM refers to the deaths occurring in the first seven days of life. Quality of infant mortality information produced in MG is usually underreported (Campos, Loschi, and França, 2007), mainly in the socio-economically more deprived areas located in northern and northeastern regions of the state. In order to define efficient policies to diminish the number of early neonatal deaths and properly distribute the financial resources, it is important to correctly estimate the associated risks.
The counts were obtained from the Sistema de Informações sobre Mortalidade (SIM) and Sistema de Informações sobre Nascidos Vivos (SINASC) from the National Health System of the Brazilian Ministry of Health (BMH). The 853 municipalities of MG were grouped in A = 75 microregions (areas) following the official division suggested by the BMH. Two periods of time comprising the two most recent Brazilian Demographic Censuses are considered, namely, 1999-2001 and 2009-2011. To analyze the datasets, we fit the proposed model assuming that Y i and E i are, respectively, the observed and the expected counts of ENM at area i = 1, . . . , 75. We We consider the usual naive estimator for the offset E i given by where n i represents the total number of newborn children at risk in the ith area and y i is the observed count of early neonatal deaths in such area. For comparison purposes, we also fit the standard Poisson model which ignores the underreporting in its structure by assuming i = 1 for all areas.
The ENM relative risk assumes a log-linear regression structure which includes local and spatial random effects, that is, log(θ i ) = β 0 + X i β + u i + s i , i = 1, . . . , 75. Five covariates are introduced in this regression model: the Municipal Human Development Index (MHDI), the proportion of mothers with more than twelve years of formal education (MomEduc), the proportion of children with weight at birth smaller than 2.5 Kg (LowWeight), the proportion of children who were born with some congenital anomaly (Anomaly) and the proportion of mothers who made seven or more prenatal visits during the pregnancy (Prenatal). The MHDI was collected from the Atlas of Human Development in Brazil (2010) and the other four covariates were obtained from the DATASUS repository, maintained by the BMH.
To define the clustering structure, we consider the adequacy index (AI) introduced by França et al. (2006) as a measure of the quality of infant mortality data collected in Minas Gerais. Based on the adequacy index, França et al. (2006)  . We consider these four groups to analyze the ENM data in both periods, 1999-2001 and 2009-2011. Since there is an expectation of improved data reporting quality in recent years, the K = 4 clusters induced by this partition may be more heterogeneous in the period 1999-2001. In order to provide a sensitivity analysis and also attempting to reduce the effect of within cluster heterogeneity, we divide each of the previous groups in two new groups obtaining another clustering structure with K = 8 categories of data quality. The median of the AI within each of the four initial groups is considered for defining the new partition into eight groups. Panels (b) and (d) of Figure 2 display the groups defined in both cases (each color corresponds to a different group).

About prior elicitation
To complete the model specification a prior distribution must be elicited for each parameter, with special attention to the informative prior needed for parameter γ 1 . According to experts' opinion, the reporting probability in areas experiencing the best data quality likely approaches one. Based on the information obtained from the specialists (local epidemiologists and health researchers) for both periods of interest, we adopt the conditional uniform prior distribution given in expression (2.6) eliciting an informative prior distribution only for parameter γ 1 (partially informative prior distribution). When considering the clustering structure with K = 4 data quality groups, we set γ 1 ∼ U (0, 0.10) for period 1999-2001 and, as an improvement on data reporting quality is expected in more recent years, for period 2009-2011 it is assumed γ 1 ∼ U (0, 0.05). When fitting the data with K = 8 clusters, we set the prior γ 1 ∼ U (0, 0.05) for both periods.
To model the prior uncertainty about the relative risks, θ, we assume the structure of Model 3 described in the simulation studies (Section 3). We set β 0 ∼ N(0, 100) and perform a variable selection by eliciting the SSVS prior given in expression (2.8) for β = (β 1 , . . . , β 5 ) with σ 2 slab = 100, σ 2 spike = 0.001 and ρ m = 0.5, m = 1, . . . , 5. For parameters s, u, σ 2 s and σ 2 u we assume the prior distributions elicited in the simulated studies (Section 3). Also, for the MCMC performed in OpenBUGS, we consider the same specifications as in the simulated studies. The complete dataset and the BUGS code considered in this case study are provided in the Supplementary Material (Oliveira et al., 2020). The same occurred for the other areas showing that an improvement in data reporting process spread out over the state. For those areas classified in the best data quality cluster, the estimated reporting probability tends to be close in both periods, which is expected as the posterior estimate for parameter i in the best group is quite influenced by its prior mean (see the discussion in Sections 2.1 and 3.2).

Posterior results
Posterior estimates for the relative risks under the standard Poisson model are displayed in Panel (e) of Figures 2 and 3. For the period 1999-2001 (Figure 2), such estimates shows that areas in the North and Northeast regions of Minas Gerais experienced the lowest ENM risks, being smaller than the risk obtained for Belo Horizonte city, the capital of the Minas Gerais State. This finding goes against the results obtained in some epidemiological studies that relate the quality of data to socioeconomic and access to health care indicators (e.g., Campos, Loschi, and França (2007)). Because the North and Northeast regions are the poorest and present the lowest socio-educational indicators in the state, experts believe that the ENM risks in such areas are much higher than those estimated through the standard Poisson model, evidencing the incapacity of this model to account for a high underregistration level. In relation to the most recent period 2009-2011 (Figure 3), the spatial distribution of the posterior estimates provided by the standard Poisson model are more compatible to what is expected by the specialists. The posterior estimates for the ENM risks in the poorest areas (North  and Northeast) are higher than the ones obtained for more developed regions of Minas Gerais. It points to an improvement in the quality of the data reporting process as indicated by the estimates for the reporting probabilities obtained under the proposed model in both periods. Moreover, compared to the estimates for period 1999-2001 (Figure 2), the ENM risks for most regions in South and Southwest of Minas Gerais decrease by 2009-2011. These results are possibly indicating the advance in the socio-economic conditions and the access to health care in Minas Gerais. Figures 2 and 3 show that the proposed model provides estimates for the ENM risks in Minas Gerais that are more compatible with the findings in Campos, Loschi, and França (2007), especially in northeastern areas for both periods. Its performance is specially good when estimating the ENM risks in the period 1999-2001, in which data quality is more questionable. By accounting for underreporting, the proposed model corrects at least part of the underestimation experienced by the poorest microregions of the state providing more realistic estimates for the ENM risks in such areas. As expected, for areas experiencing a good data quality, estimation under both the proposed and the standard Poisson models are similar. As observed for the standard Poisson model, the maps for the ENM relative risks estimated under the proposed model in period 2009-2011 disclose a decrease in the risk for most microregions in South and Southwest of Minas Gerais if compared to period 1999-2001. Table 7 summarizes the results under the fitted models. The log pseudo-marginal likelihood (LPML) criterion (Ibrahim, Chen, and Sinha, 2001) points that data from 1999-2001 are better fitted by the proposed model with K = 8 data quality clusters whereas for period 2009-2011 the proposed model with K = 4 provides the best data fit. The expected improvement in the quality of the data reporting process in the most recent period, 2009-2011, makes the microregions more homogeneous in relation to such data feature. Therefore, a smaller number of data quality categories is actually expected. For each period, only results related to the best fitted models are considered in the following analysis.

Panels (a) and (c) of
Assuming that a covariate X m , m = 1, . . . , 5, should be included into the model wheneverω m ≥ 0.5, whereω m denotes the posterior estimate for the associated inclusion probability, then Table 7 shows that different sets of covariates are significant to explain the ENM risks in the two analyzed periods. Under the best models, only the covariate MHDI shows to be significant (likely non-zero effect) to explain the ENM risk for the period 1999-2001 while, for the period 2009-2011, MHDI, Anomaly and Prenatal were significant. As expected in practice, the effect of the covariate MDHI is negative in both periods, indicating that the highest the MHDI, the smallest the ENM risk. The effect of MHDI is smaller in the period 2009-2011. Also for this most recent period, we observe that the ENM risk is smaller in areas with a high proportion of mothers who have made seven or more prenatal visits during the pregnancy. Furthermore, the positive effect associated to the proportion of children who were born with some congenital anomaly (Anomaly) indicates that such characteristic has been an important factor to the occurrence of early neonatal deaths in recent years. Covariates MomEduc and LowWeight, usually pointed out as important factors to explain the infant mortality rate, are not significant in the best model for both periods considered in our study.   In closing, it is important to mention that the relative risk estimates provided by the proposed and the standard Poisson models are closer in the period 2009-2011 than their estimates obtained for the period 1999-2001. This is an evidence of improvement in the quality of the ENM data recorded in the civil registration systems SIM and SINASC in Minas Gerais State.

Discussion
We presented a novel Bayesian modeling framework to analyze potentially underreported count data. We propose a clustering scheme that relates the reporting probabilities among the areas according to a previous data quality partitioning. Auxiliary variables and experts' opinion can be considered to assess data quality throughout the areas. One interesting feature of the proposed model is that, to ensure its identifiability, only an informative prior for the underreporting probability in areas experiencing the best data quality is required. That is attractive because in the best areas information about the reporting probability tends to be easily accessed.
Naturally, some care should be taken as the posterior inference tends to be highly influenced by our prior specification for parameter γ 1 , the underreporting probability in the best areas. In the simulation experiments, a sensitivity study involving different levels of prior information for γ 1 was performed. The results indicated that if the specified prior mean for γ 1 turns out to be widely different from the truth, then the bias correction is likely to be inaccurate. Therefore, in practical situations, it is truly relevant searching for reliable information about this particular prior distribution, especially the associated prior mean.
Our model was applied to correct the underreporting bias in a Brazilian neonatal mortality dataset. In this case, previous works guided the partitioning of the region according to the data quality experienced by its microregions. It is worth mentioning that in other case studies in which the clustering structure may not be previously available, one can apply usual clustering techniques to define the groups with basis on covariates related to the quality of the reporting system. In our application, some local epidemiologists and health researchers provided information about the reporting process in areas where data are known to be better recorded. This information is used to elicit the required informative prior distribution for γ 1 . It is likely that a different prior specification in the neonatal mortality application might result in different inference on the reporting probabilities. Consequently, it also affects the bias correction on the mortality relative risks. However, the subjective nature of the solution for completely underreported data is not unique. In Bailey et al. (2005), for example, a different choice for the threshold used to define the censored areas can lead to different predictions. That may be also observed in the model introduced by Oliveira, Loschi, and Assunção (2017) if a different informative prior is elicited for the censoring probabilities. Also, in the approach proposed by Stoner, Economou, and Drummond (2019), a distinct prior specification to the mean reporting rate could lead to quite different posterior inference. The usage of a complete validation dataset (as, e.g., Whittemore and Gong (1991) ;Stamey, Young, and Boese (2006); Dvorzak and Wagner (2015)) might be a less subjective approach depending on the quality, quantity and experimental design of collecting such data. In many cases, as the one analyzed here, the elicitation of an informative prior distribution for one parameter is a feasible and reasonable solution.
The precise mapping of risks related to vital statistics is an important tool to guide health policies that may lead to a reduction of events such as infant mortality. Estimates for the event reporting probabilities, which provide a measure of severity of underreporting, help to decide about where additional resources for surveillance programs would be most necessary and effective. The model introduced in this work is another attractive tool to account for underreporting bias in this context.
It is an interesting topic for future research to introduce partitioning models, such as Dirichlet process or product partition models, for underreported data. Such kind of models will allow us to also infer about the clusters throughout the estimation process.
Extensions of the proposed model should also consider the situation in which there are spatial patterns in the reporting process. By borrowing strength from spatial modeling and extreme learning machines, Prates (2019) introduce a hierarchical model to perform imputation over missing count data whose usage and adaptation for the context of underreporting is an interesting point for further investigation as well. Although not approached in this paper, the modeling of underreported count time series has been suggested in recent years, for instance, by Bracher and Held (2020) and Fernández-Fontelo et al. (2016). Another related problem that may interest readers is the estimation of animal abundance with differential probability of detection (see, e.g., Dorazio and Royle (2005); Hickey and Sollmann (2018)). In this context, hierarchical Poisson models are also used to model both the underlying process and the detection (reporting) probability.