Risk Models for Breast Cancer and Their Validation

Strategies to prevent cancer and diagnose it early when it is most treatable are needed to reduce the public health burden from rising disease incidence. Risk assessment is playing an increasingly important role in targeting individuals in need of such interventions. For breast cancer many individual risk factors have been well understood for a long time, but the development of a fully comprehensive risk model has not been straightforward, in part because there have been limited data where joint effects of an extensive set of risk factors may be estimated with precision. In this article we first review the approach taken to develop the IBIS (Tyrer–Cuzick) model, and describe recent updates. We then review and develop methods to assess calibration of models such as this one, where the risk of disease allowing for competing mortality over a long follow-up time or lifetime is estimated. The breast cancer risk model model and calibration assessment methods are demonstrated using a cohort of 132,139 women attending mammography screening in the State of Washington, USA.


INTRODUCTION
Unlike lung cancer (tobacco smoking) and cervix cancer (persistent infection with the human papilloma virus) where a single factor explains the majority of the cases, a large number of factors have been found to be important for determining the risk of breast cancer. An increased risk in women with a family history appears to have been known in ancient Roman times [58], and in 1842 Rigoni-Stern reported that nuns had an increased risk of breast cancer [52,23], paving the way for further research which established that not having a child or having a first childbirth at an older age increased the risk of this disease. Much of the modern work on risk factors was done by Macmahon and colleagues, mostly at Harvard [42]. Pike and colleagues emphasized the link with oestrogens and proposed that much of the population risk could explained by a computing the cumulative lifetime exposure to oestrogens based mostly on age at menarche, Centre for Cancer Prevention, Wolfson Institute of Preventive Medicine, Queen Mary University of London, Charerhouse square, London, EC1M 6BQ (e-mail: a.brentnall@qmul.ac.uk; j.cuzick@qmul.ac.uk). childbearing history and age at menopause [48]; an important meta-analysis and update on these and other factors was provided by Beral and colleagues [46].
In 1976 Wolfe discovered that breast density was another key risk factor that was broadly unrelated to the known classical factors, but of roughly equal importance in their combined predictive value [68]. This has been further developed by Boyd [11,10], McCormack and others [44]. In the same paper Wolfe also suggested [68] using breast density for risk assessment to determine how best to use mammography for breast cancer screening. Several decades later, advances in the ability to accurately stratify women into higher and lower-risk groups are likely to move early detection strategies towards this vision, and replace 'one-size-fitsall' screening with so-called 'risk-adapted' programs in which both the frequency with which a woman is screened and the modality are chosen based on the risk of breast cancer. Motivations include the identification of women at extremely high risk, who are potential candidates for risk-reducing surgery or preventive therapy [28], delineation of a group at moderately enhanced risk who might benefit from enhanced screening [59], and identification of populations at sufficiently low risk so as to require less frequency or even no screening.
A number of risk models have been developed, mostly for Caucasian women living in North America and Western Europe [31,17,4]. In this article we focus on risk assessment using the Tyrer-Cuzick model [62]. This is a hybrid of two popular sub-models often used for breast cancer risk assessment: a genetic segregation model for familial risk that is combined with a proportional-hazards regression model for other risk factors.
Breast cancer risk factors in the model broadly fit into five general categories: (1) family history and highly penetrant dominant genetic mutations; (2) factors associated with oestrogen exposure, including age at first childbirth, age at menopause and menarche (beginning of periods), use of hormone replacement therapy, height and weight; (3) certain types of prior benign breast disease, (4) breast imaging features seen on the mammogram -notably the amount of dense tissue (opaque areas on a mammogram or breast x-ray); and (5) common but individually less penetrant genetic differences (single nucleotide polymorphisms, SNPs), where several hundred relatively common genetic variants that individually have a small impact on disease have been identified, and that jointly make an important contribution to overall risk assessment via a 'polygenic risk score' [43]. There are also some other apparent risk factors which are harder to quantify, but which may improve the performance of the model.
In the rest of this paper we outline the statistical basis of our model, review and develop methods to assess model calibration that include competing mortality (which has been handled differently in different papers), and apply our model and calibration assessment methods to data from a large cohort of women attending mammographic screening in Washington, USA.

Background
To provide an assessment of the risk of disease occurring within a person's residual lifetime it is important to take into account competing risks that could lead to death from other causes. In a classical 'latent lifetimes' framework [19] each individual is subject to m potential causes of death with times T j (j = 1, . . . , m) and we observe their actual time of deathT = min(T 1 , . . . , T m ) and the specific cause J = 1, . . . , m. This framework assumes that each death is attributable to a single cause, or a defined group of causes. We extend this approach to also include the incidence of breast cancer, which is our principle interest, and do not consider death after breast cancer occurs.
Multiple modes of failure are conveniently characterised using functions for the rate at which each cause J occurs at each follow-up time, given that the person has not yet died, or experienced a specific event of interest. More precisely, this cause-specific hazard for j = 1, . . . , m is defined as Equation (1) is estimable due to conditioning on (T ≥ t). Another measure of risk that is sometimes considered is the proportion of the population who have event J when followed up to time t. This cumulative incidence function is defined by where is the marginal (all-cause) survivor function forT . When there is just one cause (death, say) then (2) is the cumulative distribution function (for time of death).
Breast cancer risk models within this framework typically consider two causes (m = 2): T 1 , the time to diagnosis of breast cancer and T 2 , the time to death from other causes (excluding breast cancer mortality, which can't occur before the time of diagnosis), so thatT = min(T 1 , T 2 ). Risk assessment for a women usually takes place when current age t 0 > 20 years, and the aim is to assess the absolute risk of breast cancer between age t 0 to age t, conditional on q risk factors x = (x 1 , . . . , x q ). This may be done by extending (2) to condition on risk factors x and age at risk assessment t 0 through: where h 1 (t | x) is the conditional hazard of breast cancer at age t (c.f. equation 1) and h 2 (t | x) is the conditional hazard for competing mortality. In practice, and following [31], competing mortality has often been taken to depend only on age and has been calibrated using national mortality statistics excluding breast cancer incidence. This is reasonable for breast cancer, as it is not strongly linked other causes of death, but is more problematic, e.g. for lung cancer, where tobacco smoking is a major factor for it and other causes of death such as cardiovascular disease.
Breast-cancer specific hazards have been estimated in two ways: 1. Regression models. Risk is frequently calculated by combination of a regression model derived from case-control or cohort studies of specific relative risks combined with absolute population-based incidence rates from cancer registries. An early and widely used example of this is the Gail model [31], which has been further developed in the Breast Cancer Surveillance Consortium (BCSC) model [15,60,45]. 2. Estimation from family pedigree data of the probability of carrying one or more high-risk mutations using segregation analysis, and then using the penetrance of a mutation to alter age-specific risk [17]. Two examples are BRCAPRO and BOADICEA [47,4].
A novel aspect of the Tyrer-Cuzick model is that it employs both methods [62]. The segregation model estimates the hazard function h G (t | x 1 ) at age t due to genetic factors, conditional on information (denoted x 1 ) about a woman's family history (family tree) of breast and/or ovarian cancer, and results of any tests in the family from known highly penetrant breast cancer genes such as BRCA1 and BRCA2. This is then combined with a relative hazard regression function r(x 2 ) > 0 based on other risk factors (x 2 ) through We divide family history into two factors 1 } corresponding to (a) breast and (b) ovarian cancer. The vector x (.) 1 has components for event time, censoring information and family relationships, and x 2 is based on the other risk factors shown in Table 1. Risk models that only account for family history based on a segregation model do not use x 2 . Regression-only models developed are often proportional-hazards models of the same form as (4), but with age-specific population rates instead of h G (.); partial information on family history of breast cancer can be included in the regression function, e.g. by including a covariate for the number of affected first-degree relatives.

Statistical model
The genetic risk breast cancer is modelled through estimates of mutations in BRCA1 and BRCA2 genes and an unknown dominant gene. Some of the genes which make up the unknown component have been discovered [27], but most have not, and are only inferred. Even the known ones are rare and individually do not make a major contribution, and are not so commonly tested for even in women with a family history. We use the variable c 1 for BRCA1/2 status (0 if not a BRCA1/2 carrier, 1 if a BRCA1 carrier, and 2 if BRCA2 carrier; joint carriers are extremely rare and are modelled as BRCA1 carriers). The unknown dominant gene is denoted c 2 (0, 1 if respectively not a carrier or a carrier). The model makes the following assumptions.
A.i Breast x where p(.) is general notation to denote a probability density or mass function.
A.ii c 1 and c 2 are independent: {p(c 1 , c 2 ) = p(c 1 )p(c 2 )}. A.iii The unknown gene c 2 is not associated with ovarian cancer [p{x A.iv The penetrance S G (t | c 1 ), i.e. the probability of developing disease by age t, is known for breast and ovarian cancer in BRCA1/2 carriers. The prevalence p(c 1 ) of BRCA1, BRCA2 and the unknown gene are known in the population. A.v A proportional-hazards model is assumed for the incidence associated with the unknown gene using two parameters θ = (β, γ) through where γ quantifies the log relative hazard attributable to the unknown gene, β is its prevalence (the per-allele proportion is 1 − √ 1 − β), and S 0 (t | c 1 ) is a baseline survivor function. From assumption (A.iv) and equation (5) the baseline survivor function is obtained by solving e.g. via a Newton-Raphson iteration [62]. A.vi Mendelian inheritance, i.e. there is a 50% chance that one of the two alleles from each parent is inherited by their offspring.
From these assumptions the breast cancer risk conditional on x 1 is obtained as where the weights p(c 1 , c 2 | x 1 ; θ) in (7) are obtained from the following application of Bayes' rule. Denoting d = (c 1 , c 2 ) then p(d; θ) uses assumptions given in (A.ii, A.iv, A.v), and p(x 1 | d; θ) is obtained following A.vi (described in more detail in [62]), first separately for breast x When gene testing has been done for BRCA1/2 the values for c 1 are taken from this, and the familial contribution only applied to the unknown gene c 2 . Most of the computer code in the Tyrer-Cuzick algorithm is involved in calculating the likelihood p(x 1 | d; θ).

Parameter estimates
Some changes to the segregation model have been made since the original model [62], starting from version 7 onwards. BRCA1/2 prevalence and penetrance are now from the 1950+ birth cohort estimates reported in [4]. The penetrance estimate is now lower than used in original model [62] (previously taken from [30]), which was attributed in [4] to bias in the original estimate [30] resulting from a focus on families with multiple cases of breast cancer. Prevalence of BRCA1 is taken to be 0.06% and for BRCA2 is 0.10% (previously 0.11% and 0.12% respectively). Smoothed first breast cancer rates from the Thames Registry 2005-2009 were used to calibrate overall risk, based on data reported by 5y age group and smoothed by loess (with the smoother 'span' parameter 0.2 chosen by eye). The unknown gene calibration has not been altered since the model was first introduced (β = 11.4% and exp(γ) = 13.04, fitted to data from [3], see [62]). The impact of the unknown gene depends mostly on the product βexp(γ) and it is difficult to accurately estimate these two parameters separately. The best fit gave a very large value for exp(γ) but almost equally good fits would arise if γ was substantially reduced and β increased accordingly. Attributing risk to a single gene is a simplification, as risk is likely to reflect a combination of genes. However it is clear that the contribution of this unknown gene is greater than that for BRCA1 or BRCA2, any other known gene or combination of single nucleotide polymorphisms and treating this either as a single unknown gene or a polygenic risk score has minimal impact on its predictive value.

Model
The relative hazard r(x 2 ) in model (4) is normalised by the mean risk in the population. This requires information on the relative hazard associated with other non-familial (personal, hormonal and lifestyle) risk factors {denote this function φ(x 2 ) > 0} and their population prevalence {f (x 2 ) where f (x 2 )dx 2 = 1} in order to obtain the normalised relative hazard In the Tyrer-Cuzick model this is approximated by treating each risk factor as independent (breast cancer risk factors in the model are largely independent of each other) and taking When the value of a risk factor is unknown it can be left blank and the normalised relative hazard associated with that factor is taken to be 1 (woman assumed to have population risk). The mean risk constants used in the model are shown in Table 1.

Parameters
The relative hazards and prevalence of risk factors in the original model were reported in [62], and included in Table 1. There have been changes and the current values are described below.
Atypical hyperplasia or LCIS The original model treated atypical hyperplasia or lobular carcinoma insitu (LCIS) as an independent risk factor. Evidence in [25] and [9] indicated that other risk factors should not be included when a women has been diagnosed with atypical hyperplasia as it is more of an intermediate endpoint. As a result we modified the model to take the maximum risk of atypical hyperplasia (only), LCIS (only) and family history risk combined with other risk factors (based on the 10y risk). For example, a young BRCA1/2 carrier with atypical hyperplasia would not have her risk assessment modified by atypical hyperplasia. This is likely to be conservative, but currently there is inadequate data to model the joint effects of atypia and other risk factors accurately.
Unknown benign disease In general the risk of subsequent breast cancer associated with a benign lesion depends on the histology, with no increased risk associated with non-proliferative lesions, about a two-fold risk associated with hyperplasia of the usual type and about a 4-fold risk associated with atypical hyperplasia [26]. Based on prevalence data reported by [26], the relative hazard from benign disease when a women has had a biopsy but pathology is unknown is taken to be 1.3; this option was not available originally. Non-biopsied lesions are not considered in the model. Menopause hormone therapy The model assumes a maximum relative hazard of 2.0 for combined therapy with an oestrogen and progestin and 1.4 for oestrogen-only hormone therapy. The relative hazard begins at unity in year 1, is half the maximum excess relative hazard in year 2 and then the maximum relative hazard therafter until stopping therapy. Risk is then ramped down following cessation to be 2/3 of the maximum excess in the first year after stopping, then one third in the next year, and no increase thereafter.
The relative risk is adjusted if BMI (body mass index) is known, being decreased by 10% of the average excess risk if obese (BMI ≥ 30 kg/m 2 ), and increased by 10% if BMI <25 kg/m 2 (not overweight or obese). These assumptions are based on results from the Women's Health Initiative trial [16]) and the Million Women Study [50].

Mammographic density
Breast density is correlated with some of the other risk factors in the model, most strongly age and body mass index. It is also measured in different ways that require different calibration. We allow three different ways to input density: a visual assessment in which (ideally) two readers estimate the percentage of the breast covered by dense (opaque) tissue and the average is used, the BI-RADS system which corresponds to four categories, and an automated volumetric system (Volpara) that estimates the percentage volume of dense tissue based on the radiographic intensity of each pixel. In order to incorporate breast density into the model without changing the effect sizes of the other factors, we first developed a measure of breast density which is independent from age and BMI at mammogram, by taking the difference between observed and expected density given age and BMI. This was developed for the different measures in case-control studies, where expected density was modelled in controls by fitting a generalized additive model [69] of different breast density measures against splines for age and BMI (see Figure 1 for a fit to volumetric percent density). The risk associated with density was then calibrated by estimating the effect of the density residual, adjusted for other factors in the Tyrer-Cuzick model using case-control and cohort studies [67,14,13].

Polygenic risk scores
Numerous studies have demonstrated the utility of polygenic risk scores. These use common gene variants which individually carry small risks but as a group provide useful information. Their utility has been observed in average risk women [65,45] and in high-risk women [21,29], and they combine effectively independently with classical risk factors and breast density [63,64]. Currently, the breast cancer association consortium has validated more than 170 risk-modifying SNPs at genome-wide significance [43], and this is likely to increase still further.
To provide an overall relative risk estimate from a collection of independent SNPs polygenic risk scores may be obtained by using published data on their per-allele risks, and allele frequencies. First the odds ratio for each of the three SNP genotypes is determined (no risk alleles, 1 risk allele, and 2 risk alleles) and these are normalised to obtain an average of unity using reported risk allele frequencies. In our model we assume the odds of two copies of the abnormal gene is the square of that for one copy, and their distribution follows the Hardy-Weinberg law. Finally to obtain an overall polygenic risk score the odds ratios for each of the genotypes are multiplied together assuming independence [45].
Data show very small correlations between polygenic risk scores and classical risk factors. We have observed that an accurate fit is obtained by treating the score as independent of the other factors in the Tyrer-Cuzick model [21,29,64]. However, theoretically, a polygenic risk score will explain some of the genetic aggregation modelled through the segregation model, and so this should be down weighted in the risk assessment. Work to refine this component is ongoing.

COMPARISON WITH OTHER RISK MODELS
It is important to understand differences between risk models, both for use of the models and to refine the scientific process of improving risk assessment. Here we note some differences between the Tyrer-Cuzick and other models.

Segregation models
Some segregation models, including BRCAPRO [47], calculate breast cancer risk only based on estimating the likelihood of carrying major genes based on a personal and family history of breast, ovarian or other cancers related to these genes. The main aim of such models is to estimate risk of being a BRCA1/2 mutation carrier. They are mostly used for women with a strong family history in order to determine eligibility for BRCA1/2 testing. They incorporate both pedigree information including age at onset and results of genetic testing for BRCA1 and BRCA2 in the women and their relatives. Such models are less suitable for risk assessment of breast cancer in the wider population, partly because they do not allow for other unknown genetic factors in the model.
The BOADICEA model accounts for other rare but highly penetrant genes (PALB2, CHEK2, ATM ), and uses a polygenic model to allow for (unobserved) risk due to other genetic factors than those measured [41]. Define c * 1 = 0 if not positive for one of the genes, if test positive for PALB2 (but not BRCA1/2 ) and so on for next CHEK2 and last ATM. Thus very rare individuals who test positive for more gene are assigned a group c * 1 = 0, . . . , 5 based on a hierarchy. The model has the following form for the breast cancer specific hazard: where t is age, h 1B (t) is a baseline hazard function, I(.) the indicator function, β j (t) are time-dependent relative hazards for each genotype, and u(t) is a timedependent random effect from a normal distribution with mean zero and standard deviation σ t . When the woman has not been personally tested, the probability of being a carrier is estimated based on the pedigree and testing results (if any) in family members. The random (polygenic) effects arise from a combined effect of several unmeasured and largely unknown genes, each of which contributes a small effect. Large scale genome-wide association studies have identified that the polygenic model for genetic susceptibility is partly correct, but a large proportion of familial aggregation of risk is still not explained by known major genes or polygenic risk [43]. For most pedigrees arising in the general population one might expect only small differences for risk estimates from polygenic vs single unknown gene models. Qualitatively, the unknown single dominant gene model is a good approximation to the polygenic model: the genotype is never observed and so gives rise to a range of effects through equations (7) and (8). Indeed, one study that compared the fit between the unknown dominant gene and polygenic models did not show an appreciable difference between them when considering additional genetic factors other than BRCA1/2 [5].
Another difference between BOADICEA and Tyrer-Cuzick is that the hazard for the unknown genetic effect γ is assumed to not vary with age in the Tyrer-Cuzick model, whereas the polygenic variance varies with age in the BOADICEA model. Thus the BOADICEA model leads to a larger decline in the relative risk as the age of the affected relative increases and appears to fit better with the literature. We will update this aspect of the model in the next version.

Regression function models
Several models have been developed based on a regression function only. These include the Gail (or BCRAT) [31], the Breast Cancer Surveillance Consortium (BCSC) [60,61] and Rosner-Colditz models [51,71]. Neither the Gail nor BCSC model include all the risk factors included in the Tyrer-Cuzick model, but they do account for ethnic differences. The Rosner-Colditz model includes the terms for the same broad risk factors as in the Tyrer-Cuzick model, and some additional factors including alcohol consumption, adolescent body somatotype (shape / size) and hormone levels. Assumptions on the risks and prevalence of risk factors differ between the models, as do the use of interaction terms.
The Tyrer-Cuzick model includes an interaction between menopausal status and BMI, assuming it is only a postmenopausal risk factor. Some data suggest that BMI may actually be a protective factor for premenopausal breast cancer but this is not included in the model [57]. Interactions are also included for BMI and HRT (as discussed above), and atypical hyperplasia (or LCIS) and all other risk factors in that it is treated as an intermediate endpoint, and the maximum risk for it alone or all other factors combined is used, following analysis from [9]. No other interaction terms are currently included.
The Gail model includes an interaction between (i) ethnicity and all other risk factors, and for certain ethnic groups an interaction between (ii) age and number of biopsies, and (iii) age at first child and number of affected first-degree relatives (negative interaction). For example, both latter interactions are included for white women, but they are not used for Asian women.
The BCSC model includes interactions between a woman's age with (i) ethnicity, (ii) benign disease, (iii) family history (negative interaction) and (iv) breast density (negative interaction), by allowing for different regression coefficients in 10y intervals starting at age 40y [61].
The Rosner-Colditz model allows for interactions between: (i) BMI and duration of premenopausal period (time from menarche to start of menopause) (ii) BMI and current duration of menopausal interval, (ii) alcohol after menopause and hormone replacement therapy, (iii) benign disease with age at menarche, (iv) benign disease with duration of premenopause, (v) benign disease with duration of menopause, (vi) height with duration of premenopause (vii) height with duration of postmenopause.
All the models (except Tyrer-Cuzick) include interactions relating to benign disease, but their forms are quite different. In the Gail model, for white women the relative risk of previous biopsies is less in women older than 50y than for younger women. In the BCSC cohort used to fit their model [61] this pattern was not observed for proliferative disease, and for non-proliferative disease the direction was even reversed (a larger effect for women older than 50y). In the Rosner-Colditz model there is a positive interaction with age at menarche and a negative interaction for the other terms (albeit none are univariately significant in the most recent model at a 5% level [51]), suggesting an older age at menarche is not protective for women with benign disease. These three models have been fitted in populations with individual-level data. Relative hazards for white women in the Gail model were estimated using a case-control study with 2842 cases and 3146 controls, including 179 cases with a prior biopsy aged <50y and 487 cases with a prior biopsy aged 50y+ [31]. The BCSC model was fitted to a cohort with more than 1 million women with up to 10y follow up, and included 6204 women with proliferative disease without atypia and 177 subsequent cancers [61]. A recent Rosner-Colditz model update included approximately 100,000 women of whom 5,246 had developed breast cancer. Benign breast disease (non-proliferative and proliferative) was reported in approximately 18,000 women at entry [51].
Potential interactions between benign disease and other risk factors have also been investigated in other studies, but also with inconclusive findings. For example, in a cohort from Nashville a negative interaction between non-proliferative disease and age was observed, such that women diagnosed with non-proliferative disease at an older age had lower breast cancer risks than women in the cohort without proliferative disease [26]; this has not been evaluated in any of the four models considered here.
Overall, due to the current lack of strong evidence to support an interaction of benign disease with age, no such interaction is included in the Tyrer-Cuzick model, and it assumes non-proliferative benign disease does not confer any increased risk.

Framework
It is important to be aware of the difference between using a model to project risks and calibrating a model from a cohort where follow-up data already exists. In the former case routine data only provides risk assuming no inter-current mortality, and a specific adjustment needs to be made for competing mortality (c.f. equation 3). In the latter case, inter-current deaths are known and can be treated as part of the censoring process. This has led to different methods for computing absolute cumulative risk, which are essential for interpreting the model predictions, and subsequently to assess how well the model is calibrated to the population under study. This has not been widely appreciated, and can lead to different estimated values of expected absolute risk from the same disease model. In this section we review some different methods for computing expected cumulative risk under the following setup.
Assume there is a sample of i = 1, . . . , n independent individuals with data on risk factors. The risk model evaluates each risk from current age t 0 to age t, defined as P x i (t 0i , t) or P i (t) using shorthand notation following equation (3). The time to breast cancer or death is subjected to right-censoring, so that one observesT = min(T 1 , T 2 , T C ), where T 1 is time to disease, T 2 is the time to death from other causes and T C is a right-censoring time.
An issue to contend with for computing the expected probability of breast cancer is that the 'at-risk' interval should be from t 0 to T C , whereasT is often used. T C is not observed when min(T 1 , T 2 ) < T C and this presents difficulties as its (conditional) distribution can be hard to determine. Calculating the expected risk using probability of breast cancer over (t 0 ,T ) is liable to underestimate expected risk. A solution is to use lifetable methods which are based on the sub-hazard of breast cancer and not the probability of breast cancer, but only need to be computed over (t 0 ,T ). We next consider these different methods in more detail.

Expected risk based on the cumulative disease-specific hazard function
The expected number of breast cancer cases in the cohort may be obtained by first integrating breast cancer specific hazards over the observed follow-up period, to obtain each individual's cumulative hazard for breast cancer during the period she is at risk. Formally, let Y i (t) = I(T i > t) denote the at-risk process for individual i = 1, . . . , n, and h 1 (t | x i ) = h 1i (t) the disease-specific hazard function sub-model. Then the expected number of individuals with disease (J = 1) is Equation (11) may be adapted to compute the expected number of deaths from other causes by replacing h 1 with h 2 , etc. Indeed, by letting h 2 reflect all-cause mortality (including breast cancers) this is equivalent to the time honoured lifetable method to compute the expected number of deaths [39]. Thus while it might seem surprising to base the expected number of women diagnosed with breast cancer on the disease-specific cumulative hazard rather than the smaller estimate of the probability of breast cancer, this method is firmly established in lifetable and other settings (e.g. the log-rank test).

Expected risk based on cumulative incidence
The previous section is the preferred method (see below) but another approach is to calculate expected risk using the model probability of disease allowing for death from other causes (cumulative incidence). This is most straightforward when the model is used in a cohort with no censoring from starting age t 0 up to age t. In this case the expected number is a summation of the conditional cumulative incidence in (3): where P i (t) ≡ P x (t 0 , t) for the i-th individual. It is extremely uncommon to have such a dataset, and the calculation is more complicated under the most common scenario of right censoring because we need an estimate of P{J = 1, min(T 1 , We consider three types of censoring processes for prospective risk calculation.
Firstly, if censoring is fixed to be to a common follow-up time for each patient t i = t 0i + A, such as A = 5 years, then Secondly, when the censoring time T C is variable between individuals but it is deterministic, such as being due to variable calendar time enrolment and a common follow-up for all i then A third case is when the potential censoring time is unknown but is a stochastic process with known distribution. Then one possibility is to base expected risk on where S C is the survivor function for the censoring distribution. Then If S C is unknown then one will need to estimate it, for example by assuming censoring is independent of risk factors x and using the Kaplan-Meier estimator [40].

Calibration
The above methods based on cumulative disease-specific hazards or cumulative incidence may be used to assess calibration by comparison of the observed number of women with disease (O) with expected (E), both overall and in sub-groups [37]. If it is assumed that disease is rare (this assumption is relaxed in Section 4.7), then a test may be constructed assuming that O is generated from a Poisson distribution, whose rate would be E if the model was perfectly calibrated. Exact Poisson confidence intervals on O may be used to compute a confidence interval for O/E (treating E as fixed) and determine if it covers unity [32].

Some biased estimates of absolute risk and their effect on calibration
Some commonly-used methods are biased, but this can be overcome using hazard-based methods where the appropriate 'at-risk' interval is used. We next consider three methods that yield biased estimates of E, the expected number of cancers, which in turn leads to biased estimates of the observed to expected ratio O/E.

1.
Under right censoring some studies [18,24,7,54] have used cumulative incidence as the measure of expected risk, but estimated it via This measure of expected risk is biased towards zero because it only computes risk until the event timeT . For example, suppose we wish to compute the expected number of deaths in a sample of babies over the next 200y. If the mortality model is well calibrated but risk of death until the age at which each person actually dies is summated (as equation 17), then fewer deaths will be expected than the number of babies in the sample. The correct analysis based on cumulative incidence would summate risk of death by age 200y for each individual (c.f. equation 12). In mitigation, the bias will be small, and will have a minimal impact on the conclusions drawn when the event is rare. 2. In some studies it has been reported [2] that the expected number with disease is computed as i.e. summating 1 minus the net survival (the estimand for Kaplan-Meier estimation) rather than (11). In general this is biased towards zero for the expected number of breast cancers because the term in the summation is less than or equal to the cumulative hazard H 1i (T i ). However, this is a small bias if the cumulative hazard H is small since 1 − exp(−H) = H + O(H 2 ). Also note that 1−exp{−H 1i (T i )} ≥ P i (T i ), so is less biased than summating P i (T i ) (i.e. equation 17). 3. Finally, calibration of breast cancer risk has sometimes been assessed using a fixed follow-up time for all by only including those who could be followed up at least that long [60,53]. This in inefficient because some data would be excluded, and it should be used with caution when there is censoring. For example, suppose we seek to assess calibration of 5y risk when there is right censoring in the data, and include only those without breast cancer who were not censored by 5y and all breast cancer cases diagnosed up to 5y. If the model is correct (and censoring is non-informative) then O/E based on (12) will be biased. E will be too small relative to O because some of the non-cases will have been excluded from expected risk due to censoring, but they should have contributed to E as they would have been counted as cases if disease had occured in them.

An unbiased estimate of expected risk
To avoid bias we recommend using E (H) to obtain the expected number with disease. This is based on the cumulative hazard rather than probability of disease (cumulative incidence) and has several advantages compared to E (P ) when censoring has been handled appropriately as described in Section 4.3.
Firstly, comparing E (H) with O is a test of whether the disease-specific hazard model is correct. The same method may also be used to test the competingmortality model. Comparing E (P ) with O is a test of whether the model for cumulative incidence based on combining the disease risk and competing mortality is correct, i.e. whether the marginal risk based on (3) is well calibrated. This might require accurate knowledge and data on the competing risk process, which is often not available. It is uncommon for the competing mortality model to be as well developed as the disease risk model, which is of primary interest to the epidemiologist. If the competing mortality model is incorrect, but the disease model is correct, then the expected risk will be incorrect due to an inadequate model for competing mortality. Conversely, even if the cumulative incidence function risk is well calibrated, it is possible that the competing mortality and disease model are both wrong in different ways, and just happen to cancel each other out. Comparing the hazard sub-models separately would identify such issues.
Secondly, when there is stochastic censoring equation (11) is easier to apply than (16) because it does not require an assumption or estimate for the conditional censoring distribution.
Thirdly, it can be a more powerful test. For example, in an extreme example of 200y follow-up on mortality in babies, then one would not be able to show that a model that assigns a constant and excessively high mortality rate was incorrect; whereas analysis based on summating the cumulative hazard of that model to the time of each death would yield an expected number of deaths in excess of observed.

Extensions
We end this section by describing methods to assess calibration not only for the total population, but also in subgroups. For example, when the data are split into ten groups by decile of the predicted risk at baseline then a test of calibration (more generally model fit) across the groups would be akin to the Hosmer-Lemeshow test for binary outcomes [35,37]. To extend this strategy to the cumulative incidence framework, we construct k = 1, . . . , K distinct groups, and assess whether the observed risk matches expected in each of the groups. Then, the test statistic where the expected E k are obtained as above, will be approximately χ 2 with K − 1 degrees of freedom under the null hypothesis of correct prediction and calibration. This hypothesis testing approach has limitations. Firstly, power to reject inaccurate calibration is limited by the large number of degrees of freedom. Secondly, with a large enough sample size we would expect to reject model calibration be-cause all models are wrong, and the approach does not directly indicate where the inaccuracies lie.
An alternative approach is to view testing calibration as special case of Poisson regression. Let O i denote whether individual i = 1, . . . , n in the sample has disease, and E i be the expected risk based on the model. Then consider a Poisson regression for where θ = exp(γ 0 ) and γ 1 are unknown parameters. Setting γ 1 = 1 then the maximum likelihood estimateθ = i O i / i E provides an overall 'calibrationin-the-large' parameter for the entire cohort [20]. When γ 1 is also estimated then it can be used to test for calibration across the range of expected risk (on 1 df), as well as yielding an estimate of how closely the regression line matches observed (γ 1 = 1 being ideal; 0 being not at all).
Finally, we consider when the disease is relatively common, perhaps because focus is on a high risk group with long follow up. In this case the Poisson distribution no longer approximates a binomial distribution. However, Poisson regression may be valid if time is broken into shorter segments where the rare disease assumption holds, for example by year, with each segment treated as a 'new' observation in the Poisson regression analysis. This approach is also useful when assessing time-dependent calibration, and which we proceed to consider next.

TIME-DEPENDENT CALIBRATION
The focus of many studies has been on whether the cumulative number with disease over a single follow-up period matches the expected number. While important, it does not assess differences through time, such as non-proportionality of the observed to expected risk. In this section we consider techniques to assess calibration across follow-up time, and develop new graphical methods.

Methods
One approach is simply to look at the observed and expected number of events N j (t) for each cause j as a counting process. The observed number iŝ The expected number, based on the cumulative hazard approach (as Section 4.2), is so that we have extended (11) to be a function of t. One may compare observed N j (t) vs expected from (20) through follow-up time t, such as is sometimes done to check proportional hazards [6,33]. Cumulative pointwise (such as from entry to time t), or interval (such as number in a year) confidence intervals may be constructed using exact Poisson confidence intervals, or used to undertake hypothesis tests of calibration up to or near time t.
An issue with the comparisons based on N (t) is that they depend on the sample size and censoring distribution, so cannot be readily used to compare between studies. To circumvent this one might consider the comparing the mean cumulative hazard among those at risk, i.e. for t < max i (T i ) comparinĝ with its expected value where the total number at risk at time t is Y + (t) = n i=1 Y i (t). Note that equation (21) is the Nelson-Aalen estimator of the cumulative hazard [40]. Equation (22) is the expected cumulative hazard based on the breast cancer risk model and observed at-risk process. Time-dependent assessment of the observed to expected risk can be based onĤ j (t) vs H j (t) across follow-up time t. Here we treat H j (t) as fixed and use the usual confidence intervals for the Nelson-Aalan estimator H j (t) at different times, from which the hypothesis of calibration at a particular time t can also be assessed.
Other functionals ofĤ j (t) and H j (t) might also be used for assessment of calibration across time. Arguably the most common functional used in applications is the 'net' risk which is often estimated as 1 minus the Kaplan-Meier estimator [40]. There are two options for calculating the expected risk. The first is a mean from the entire cohort at baseline The second is to use the mean hazard in those still at risk at time t through and (22).
A disadvantage of the cumulative hazard (22) and net risk (23,24) is that interpretation of the component terms is not straightforward in relation to the observable risk in equation (2). However, if we weight (21) by the proportion at risk using the Kaplan-Meier estimatorŜ(t) for all-cause survival (i.e. the survival endpoint is disease or death), then we recover an estimator of the cumulative incidence function througĥ and inference may be based on estimates of its variance [40]. Expected risk may be obtained from (26) or the mean expected risk at each t over the n individuals.

Summary
Our recommendation is to focus on comparisons only involving the diseasespecific hazard to assess O vs E through follow-up time t. For exploratory plots we prefer use of the cumulative hazard comparison of (22) with (21). There are several reasons for these recommendations.
Firstly, as for overall calibration, the method may be used to assess the disease risk and competing mortality sub-models separately. The cumulative hazard analysis extends quite naturally a total number of events comparison, but avoids dependence on the censoring distribution.
Secondly, focusing on the hazard leads to using the observed at-risk process to calculate expected risk, which appears to be a more robust method to assess model calibration than only using a risk assessment estimate on everyone at baseline. For instance, if those censored or lost to follow-up are more likely to be at higher predicted risk at entry, then estimates of their expected risk will differ, and so comparison of expected risk between methods based on using the at-riskprocess or everyone at baseline is a way to assess whether the censoring process was associated with predicted risk. In general terms this issue has parallels with relative survival, where one might consider the Ederer-I approach (mean risk) or the Ederer-II approach (use the at-risk process) [55].
Thirdly, a statistical advantage of the cumulative hazard compared to the cumulative incidence function is that it may have a smaller variance because it does not require an assumption or estimation of all-cause survival (equation 25).
Finally, in the exploratory plots when the aim is to assess calibration of a risk model, then differences between the observed to expected cumulative hazard seem easier to interpret than the other methods. In contrast, the cumulative incidence function weighs the cumulative hazard by all-cause survival. The component terms have a natural interpretation, but the observed to expected comparison is perhaps less easy to interpret for assessing calibration than cumulative cause-specific hazards, because changes could be due to either lack of calibration in the disease or competing mortality models. Kaplan-Meier estimation transforms the cumulative hazard through 1 − exp(−H), so that differences between large cumulative hazards are downweighted. This makes changes in the calibration plot with time less easy to relate to calibration at the hazard level. For example, it is easier to assess whether the hazard function is constant through time by plotting the Nelson-Aalan estimate of the cumulative hazard, than by plotting 1 minus the Kaplan-Meier estimator.

Data
We have previously reported an evaluation using the Tyrer-Cuzick model in a cohort from Washington state USA [13]. Between 1996 and 2013, 132 139 women aged 40-73y completed a risk questionnaire and had a measure of breast density taken at entry. They were followed up beginning 6-months after the entry mammogram to the earliest of diagnosis of invasive breast cancer or censoring. Women were censored due to disenrollment (n=62 331, 48.2%), end of follow-up (n=48 317, 37.3%), age 75y (n=15 827, 12.2%), death from other causes (n=2328, 1.8%) or ductal carcinoma in situ (n=637, 0.5%). Only aggregate data on competing risk causes were made available for analysis, so we were unable to apply methods that require individual-level data about the cause of censoring.

Calibration
A total of 2699 breast cancers were observed (O), and based on our preferred method (equation 11) 2757 were expected (E), yielding O/E 0.98 (95%CI 0.94-1.02). We next consider application of incorrect methods discussed in section 4.5.
A first biased assessment is cumulative incidence over a 5y fixed-time horizon by using all the cases that were diagnosed in the 5y period and all non-cases at risk for 5y or more. It we calculate the expected 5y risk in those who are at risk at 5y then E = 877 against an observed O = 1157 (O/E = 1.32). A second biased method is to summate expected cumulative incidence to the last follow-up for each individual (equation 17), yielding E = 2605 and O/E = 1.04 (95%CI 1.00 to 1.08). A third biased method is to follow equation (18) and summate the absolute net risks, which gives expected number E = 2679 (O/E = 1.00).
It also instructive to repeat the above methods in a high-risk group (> 8% ten-year net risk) where greater differences between the methods to calculate E will occur. The Tyrer-Cuzick model with mammographic density identified 4645 women to be in this group, and 273 breast cancers were subsequently diagnosed. Based on the preferred method (11) we find E = 349, so that O/E = 0.78 (95%CI 0.69-0.88), indicating that the model over predicted the high risk group. If instead (18) was used then E = 324 and O/E = 0.84 (95%CI 0.75-0.95). Based on (17) we find E = 310 so that O/E = 0.88 (95%CI 0.78-1.00), where the 95%CI covers unity. Thus the correct analysis showed a lack of calibration that would not be seen so clearly with the biased methods.
This analysis shows the method to calculate expected risk can cause practically important differences in interpretation of risk model calibration, particularly for high-risk groups.

Time-dependent calibration plots
Time-dependent calibration was assessed using the methods in Section 5. We present results from overall model calibration ( Figure 2) and in the highest and lowest risk deciles ( Supplementary Figures S1 and S2) following [13].
Overall the model appeared well calibrated throughout follow-up time. As follow-up started at six months, the expected hazard in the first year was much lower than the second year of followup. Figure 2a shows the model did not adequately take into account the effect of removal of a pool of cancers diagnosed at entry. However, subsequently the model tracked the observed number quite closely. Figures 2c,d show respectively the Nelson-Aalan and Kaplan-Meier curves. The corresponding observed to expected plots in Figures 2e,f are virtually identical using the expected risk based on those still at risk. There is also very little difference between the two methods to obtain expected risk using the Kaplan-Meier approach, which indirectly confirms that censoring was unrelated to risk assessment.
The same plots were used to assess high and low risk groups. Supplementary Figure S1 considers women who were in the highest predicted risk decile at entry. As noted in Section 6.2, there is some evidence of over-estimation from the model for this group. However, the observed to expected plots in Supplementary Figure S1e,f show a fairly constant pattern across time, with no discernible difference between Nelson-Aalan or Kaplan-Meier comparisons. Supplementary Figure  S2 shows calibration of the bottom decile, where calibration appears reasonable through follow up time, albeit with a suggestion that the model had a tendency to under estimate risks.

Regression analysis
The above analysis suggests that the risk model was calibrated overall in the cohort, but tended to overestimate risk in those predicted at highest risk; it over-estimated risk in the first year due to omitting screen-detected cases, but the risk predictions thereafter were stable with greater follow-up time. All these aspects may be jointly tested through a Poisson regression model. The model was fitted with an exponential link function. The log hazard per year (or until the event or censoring if in that year) was included as an offset term, and calibration coefficients were estimated for (i) overall lack of calibration in the first year, (ii) stability of calibration from year two onward (follow-up year), and (iii) 10y risk categories from the model as previously defined (<2%, 2 to <3%, 3 to < 5%, 5 to < 8%, ≥ 8%; with 2-3% group as the reference category) [13].
Results are shown in Table 2, taking the exponent of the estimated model parameter to give a calibration coefficient where unity indicates ideal calibration. They confirm the earlier analysis. The risk model over estimated risk in the first year by a factor of two (calibration coefficient 0.50, 95%CI 0.40-0.63), which is due to the expected short-term reduction in risk following a negative screen. There was no loss in calibration subsequently (calibration 1.00, 95%CI 0.99-1.01). The highest risk category (10y risk ≥ 8%) exhibited evidence of over-estimation of risk relative to the reference average-risk category (calibration overall 0.80, 95%CI 0.69 -0.92, i.e. including the intercept), and likewise there was some under-estimation in the lowest-risk group (overall test for calibration across the five groups likelihood-ratio χ 2 = 48.6, df=4).

CONCLUSION
In this paper we reviewed the statistical foundations of a model to evaluate breast cancer risk at different follow-up times. This model, and corresponding computer program, synthesise findings from many scientific studies by combining risk factors into a hybrid genetic segregation and regression model. The model has been used to guide entry criteria into prevention trials [22], and is recommended by several organisations to guide preventive and early detection strategies, including the American Cancer Society to help determine eligibility for supplemental breast magnetic resonance imaging [56].
We developed new graphical methods for looking at calibration over follow-up time that can be combined with Poisson regression analysis. The impact of using the different methods to assess the calibration of cumulative absolute risk was reviewed. Different methods to account for competing risks when assessing risk model calibration have been used. We note that, given the diversity of methods in use and potential impact on findings, it is important when presenting calibration findings that a detailed description of the method for expected risk is also given [1].
There are opportunities to refine and improve the current risk model. Other factors are known to affect the risk of breast cancer, but most have some difficulties for inclusion in a relatively simple model. Using weight change from age 20 or weight at a young age in addition to current weight does appear to improve risk assessment to some extent [36], but requires accurate recall of previous weight. Alcohol consumption is a well documented risk factor [34], but underestimation of consumption is well known and it is not clear how best to allow for this. The differing roles of weight in the pre-and postmenopausal periods also needs further investigation [57]. Physical activity and dietary factors are also known to affect risk [49,70], but identification of the key factors and simple accurate measures of these currently present difficulties. Levels of oestrogens and testosterone may also be important especially for postmenopausal women, but these require a blood sample and are not routinely available [71]. Ethnic differences in breast cancer risk are also well documented -notably in the US lower risks in Hispanics and Asians [8]. However it is unclear as to how these are related to differences in the known risk factors in the model, so would not need model recalibration, or whether there are intrinsic ethnic differences which require use of a separate baseline hazard function, and possibly separate risks for the known factors. Differences between recent migrants and established individuals are also known to exist and complicate any model adjustments [38]. Adding breast density has substantially improved the model, and might be extended by considering longitudinal measurements of density and features in the mammogram other than simply breast density, where more complex algorithms may be able to extract more risk information [66]. The polygenic (SNP) risk score is steadily being improved as more SNPs are added, but this is easily accommodated as all that is required is the relative risk. A bigger challenge is to accurately account for interactions between risk factors. Our model was developed by synthesizing risks for individual factors from large overviews of many studies, and large datasets which contain all the factors will be needed to fully explore interactions. These are not currently available but are likely to be so soon, as the risk factors used in our model are currently being collected in several large cohorts of patients.
In conclusion, we are gradually moving towards an era of precision medicine, where disease treatment, early detection and prevention strategies will depend on risk assessment. The accuracy of risk models will underpin this approach.