Quantification of the weight of fingerprint evidence using a ROC-based Approximate Bayesian Computation algorithm for model selection

For more than a century, fingerprints have been used with considerable success to identify criminals or verify the identity of individuals. The categorical conclusion scheme used by fingerprint examiners, and more generally the inference process followed by forensic scientists, have been heavily criticised in the scientific and legal literature. Instead, scholars have proposed to characterise the weight of forensic evidence using the Bayes factor as the key element of the inference process. In forensic science, quantifying the magnitude of support is equally as important as determining which model is supported. Unfortunately, the complexity of fingerprint patterns render likelihood-based inference impossible. In this paper, we use an Approximate Bayesian Computation model selection algorithm to quantify the weight of fingerprint evidence. We supplement the ABC algorithm using a Receiver Operating Characteristic curve to mitigate the effect of the curse of dimensionality. Our modified algorithm is computationally efficient and makes it easier to monitor convergence as the number of simulations increase. We use our method to quantify the weight of fingerprint evidence in forensic science, but we note that it can be applied to any other forensic pattern evidence.


Introduction
For more than a century, fingerprints have been used with considerable success to identify criminals or verify the identity of individuals. In this paper, we define a fingermark (left panel of Figure 2), or mark, as the impression resulting from the inadvertent contact between the finger of an unknown donor and a surface (e.g., at a crime scene). We define a control print (right panel of Figure 2), or print, as a finger impression collected under controlled conditions from a known donor (e.g., a suspect). The purpose of forensic fingerprint examination is to support the inference of the donor of a fingermark. Currently, this inference process relies on the visual comparison between the fingermark and control prints from one or more candidate donors who may have been selected through a police investigation or by searching the fingermark in a database of prints from known individuals. Following the examination of fingerprint evidence, it is customary for the examiner to report one of two possible conclusions: an opinion of "identification", implying that the source of the fingermark is the donor of a given control print; or an opinion of "exclusion", implying that the source of the fingermark is not a considered candidate. Alternatively, the examiner may find the examination "inconclusive", indicating that the characteristics of the impressions being compared are not sufficient to reach one of the two possible conclusions (e.g., when the impressions are too small or too degraded).
The categorical conclusion scheme used by fingerprint examiners, and more generally the inference process followed by forensic scientists, have been heavily criticised in the scientific and legal literature (Cole, 2004(Cole, , 2005(Cole, , 2009Kaye, 2003;Koehler, 2005a, 2008;Zabell, 2005). Instead, scholars have proposed to characterise the weight of forensic evidence using the Bayes factor as the key element of the inference process (see Aitken, Roberts and Jackson (2010) for a comprehensive discussion).
It is worth stressing that forensic scientists do not have a complete picture of all the evidence available in a given criminal case (e.g., eyewitness evidence; means, motive and opportunity of an individual of interest; other material and non-material evidence), which prevents them from assigning probabilities to the different propositions that the parties may have regarding a particular criminal activity. Furthermore, forensic scientists are not tasked with the fact-finding mission in the criminal justice system and are not in charge of the decisionmaking with respect to the propositions of the parties. Therefore, the role of the scientist is necessarily limited to reporting the amount of support of the forensic evidence for these propositions in the form of a Bayes factor. It is of primary importance to note that, in this setting, we are not only concerned with supporting the correct model, but also in supporting it with the appropri-c 2 0 1 9 S D S U ate magnitude. An appropriate magnitude of support is critical to ensure the coherent combination of the respective weight of the multiple pieces of evidence that may be considered in a given case. This imperative requirement is the main motivation behind the present work and is the main force that drives us away from more deterministic pattern recognition algorithms. Assuming a fingermark recovered in connection with a crime and a suspect, Mr. X., we consider the following two alternative propositions: H 1 : the fingermark was left by Mr. X. H 2 : the fingermark was not left by Mr. X., but by another person in a relevant population of potential donors.
To address the so-called prosecution and defence propositions, H 1 and H 2 , we define two corresponding models: M 1 , representing how Mr. X. generates fingermarks, and M 2 , representing how fingermarks are generated by the donors in the population of alternative sources. Let e u denote the set of observations made on the fingermark. We are interested in evaluating the following Bayes factor: where π(·) is a measure of belief about the model and its parameters (see Robert (2007), chapter 7, for a formal discussion on Bayesian model selection), and f (·) represents a (marginal) probability density function. Fingerprints are usually characterised through certain features of the ridge pattern, such as the general pattern formed by the friction ridge flow and events disturbing the continuity of the ridges. These events, traditionally called minutiae, occur when a ridge ends (ridge ending) or bifurcates (bifurcation). Other types of events exist but are mainly combinations of these two basic types. Details of the ridge pattern. Ridges appear darker than background. Ridges can end (ridge ending), or split (bifurcations); ridges can create enclosures (in upper left corner), or can be very short (upper right corner); ridges can be so short that they appear as dots. Short, narrow and non-continuous ridges that appear between two parallel ridges are called incipient ridges. White dots within the ridges are sweat pores.
When comparing two fingerprints, an examiner first verifies the compatibility of their general patterns and then determines whether the spatial relationships, c 2 0 1 9 S D S U types, and orientations of the minutiae on both impressions correspond. Mathematically characterising fingerprint patterns results in heterogeneous, highdimension random vectors: minutia locations and spatial relationships are continuous measurements; minutia types are discrete observations; and minutia orientations are circular measurements. In addition, the dimensionality of the problem increases with the number of minutiae observed on a given impression. Therefore, modelling the joint likelihood of the characteristics of multiple features observed on an impression seems to be an unreasonable challenge. While many attempts have been made to assign Bayes factors for fingerprint evidence (see Abraham et al. (2013) for a recent review), no algorithm has gained wide acceptance. In particular, while the results obtained with the algorithm proposed by Neumann, Evett and Skerrett (2012) were used to support the general admissibility of fingerprint evidence in U.S. courts (State v. Hull, 2008;State v. Dixon, 2011), commentators argued that the model had two main shortcomings. Some commentators discussed that the algorithm did not result in a formal Bayes factor as it does not formally incorporate user beliefs (Cowell, 2012;Graversen, 2012;Kadane, 2012;Lauritzen, 2012;Stern, 2012). Others noted that the algorithm relied on an ad-hoc weighting function used to palliate the inability of the authors to assign Bayes factors at 0, and that this function had no other justification than convenience (Balding, 2012;Fotopoulos, 2012;Jandhyala, 2012;Kadane, 2012).
In this paper, we take advantage of the similarities between the algorithm proposed by Neumann, Evett and Skerrett (2012) and the Approximate Bayesian Computation framework to provide a method to formally and rigorously assign Bayes factors to fingerprint evidence. Our method leverages the property of the well-known Receiver Operating Characteristic (ROC) curve to address shortcomings in existing ABC algorithms. Our application addresses the issues raised in relation to Neumann, Evett and Skerrett (2012) and provide a much needed general framework for the quantification of the weight of any type of forensic pattern evidence, as long as a similarity measure can be defined to compare two pieces of evidence.

Approximate Bayesian Computation
Approximate Bayesian Computation (ABC) originated as a class of algorithms designed to sample from the approximate posterior density of a vector of parameters, θ, given an observed data set, D, without direct evaluation of the likelihood function, f (D|θ). This class of algorithms is especially useful in complex and high-dimensional settings where the likelihood function is not available in a usable form (see Robert et al. (2011) or Sisson, Fan andBeaumont (2019) for an overview).
To sample from an approximate posterior density, vectors of parameter values are first sampled from a prior distribution over the parameter space, and then used to generate pseudo-observations (each denoted D * ) from an assumed generative model. A vector of parameter values is retained if the distance measured by a kernel function, ∆{·, ·}, between values of a summary statistic, η(·), c 2 0 1 9 S D S U of D and D * is less than some tolerance, t > 0. In other words, the i th sampled parameter vector, θ (i) , is retained if the corresponding distance satisfies ∆{η(D), η(D * (i) )} < t.
In some application problems, such as forensic science, statisticians are not concerned with assigning posterior probability distributions in the parameter space, but are interested in performing model selection. However, performing likelihood-based inference using patterns of forensic interest, such as fingerprints, shoe sole impressions, or striations on bullets, is not feasible in the original feature space of the data. Starting with Pritchard et al. (1999), ABC has evolved into an algorithm that can be used for model selection by considering a model index parameter, M, and its prior distribution over model indices. The model index determines the prior distribution over the parameter space and the likelihood structure used to generate pseudo-observations. The i th sampled model index parameter, M (i) , is retained if the corresponding score satisfies ∆{η(D), η(D * (i) )} < t.
The original ABC method for assigning a posterior probability to a model, given the observed data, is a function of the ratio of the number of times that model index has been retained over the total number of times any model was retained (Pritchard et al., 1999;Beaumont, 2008;Grelaud et al., 2009;Toni and Stumpf, 2010;Didelot et al., 2011;Robert et al., 2011). Mathematically, the ABC posterior odds for the comparison between two models is given by: where I(·) is the indicator function and the subscript t in π t indicates that this measure is a function of the choice of the tolerance level. The ABC Bayes Factor, BF abc , can then be assigned using the ABC approximation of the posterior odds, divided by the prior odds The ABC algorithm for comparing two models is summarised in Algorithm 1 (Robert et al., 2011).
The benefits of Approximate Bayesian Computation methods are not without cost. Model selection using ABC is subject to two primary sources of error: 1. the quality of the approximation to the likelihood function due to the use of the tolerance, t, 2. and the loss of information engendered by using a non-sufficient summary statistic.
The effect of the tolerance level, t, on the performances of ABC algorithms has been widely discussed since the inception of ABC methods. In short, if t is too large, too many samples from the prior distribution are accepted and the approximation becomes invalid; and, if t is too small the rate of acceptance is c 2 0 1 9 S D S U Algorithm 1: ABC model selection algorithm  Data: Observed data, D.
The motivation for using summary statistics, rather than the full data set, stems from the curse of dimensionality encountered with the high-dimensional data sets that are common to scenarios in which ABC is necessary. Ideally a summary statistic that is sufficient across models should be used to enable the convergence of BF abc to the Bayes factor (Robert et al., 2011;Didelot et al., 2011). In general, it seems difficult, if not impossible, to find sufficient summary statistics for the type of data that are typically used in ABC method. The goal is then to avoid the curse of dimensionality while minimising the loss of information encountered when using non-sufficient statistics. While it may seem that a general solution addressing the issue of the potential lack of sufficiency of summary statistics could consist in using a large set of summary statistics with the hope that they will jointly tend towards sufficiency by decreasing the loss of information, this is not the case. As the dimensionality of the set of summary statistics increases, the algorithm will also suffer from the curse of dimensionality (Blum and Francois, 2010).
Several modifications of the traditional ABC algorithm for model selection have been proposed with the goal of addressing some of the issues related to the determination of the tolerance and the selection of low dimension summary statistics that minimise the loss of information. Most of these modified algorithms are rooted in the one proposed byBeaumont (2008), who suggests using a polychotomous weighted logistic regression model that has been trained on the summary statistics of the pseudo-observations and corresponding model indices to predict the posterior probability of a model. Modifications include variable selection (Estoup et al., 2012), replacement of the logisitic regression by artificial neural networks (Blum and Francois, 2010) or random forest algorithm (Pudlo et al., 2016), or the use of posterior probabilities as summary statistics (Prangle et al., 2014a).
Finally, we highlight the work by , who propose a set of necessary and sufficient conditions for summary statistics that result in a corresponding BF abc that supports the true model asymptotically.  suggest that summary statistics should be chosen so their asymptotic c 2 0 1 9 S D S U mean is different under opposing models -so different that their ranges are non-overlapping. We note that showing that summary statistics satisfy these conditions in a real application is not trivial, and furthermore, that we are not only interested in supporting the correct model, but we are interested in supporting it with the correct magnitude. Nevertheless, we will show below that our method relies on a similar argument to that of . The literature shows that the issues of sufficiency, variable selection and tolerance level are usually entangled. The solutions to these issues proposed to date, and summarised above, have their own sets of complications: 1. None of these solutions enables formal monitoring of the convergence of BF abc to the Bayes factor as the value of the tolerance, t, is reduced, or as the number of samples increases. Some convergence results for the regression adjustments in the context of parameter inference can be found in Blum and Francois (2010) but are not directly applicable to the model selection context. 2. Most of these solutions are focusing on selecting the correct model, but are not necessarily designed to support the selected model with the appropriate magnitude, which is a critical requirement in forensic science. 3. Most of these methods require to generate summary statistics from pseudoobservations that are as close as possible to the summary statistics of the observed data. That is, these methods attempt to estimate the behaviour of the model selection algorithm as t → 0 + without ever observing t = 0 (or any value even reasonably close due to the curse of dimensionality). This involves the use of potentially complicated variable selection methods, and the definition of an appropriate kernel function to compare summary statistics (Prangle, 2017). 4. Methods based on generalised linear models (Beaumont, 2008), artificial neural networks (Blum and Francois, 2010), random forests (Pudlo et al., 2016) or other classifiers can be very computationally intensive depending on the dimension of the summary statistics or the number of pseudoobservations generated from the considered models. They also heavily rely on appropriately estimating the (very large number of) parameters of these classifiers. 5. Some of the model selection algorithms, such as the one proposed by Beaumont (2008), are trained using a limited subset of pseudo-observations. These pseudo-observations are selected or weighted based on the proximity of their summary statistics with those of the observed data. Unfortunately, this results in replacing user-defined prior probabilities on model indices with probabilities assigned in unpredictable ways by the algorithm based on the proportions of training data selected from each model.
In this paper, we propose a modification to the ABC algorithm that utilises a relationship between the ABC Bayes factor and the Receiver Operating Characteristic (ROC) curve. Our novel ROC-based approach addresses the instability created by the use of a heuristic tolerance level. Our algorithm can accommodate sets of summary statistics of any size without being subject to the curse of c 2 0 1 9 S D S U dimensionality or worrying about the proximity of the summary statistics of the observed and pseudo-data, as long as conditions similar to the ones suggested by  are satisfied. Critically, our solution enables us to rigorously control its convergence as the number of simulation increases. Furthermore, for the two proposed versions of our algorithm, one requires estimation of only four parameters, and the other does not require estimation of any parameters.

ROC-ABC algorithm for model selection
ROC curves are traditionally used to measure the performance of a binary classifier as the decision threshold, t, is varied across the domain of (dis)similarity scores, ∆{·, ·}, that can be returned by the classifier (Pepe, 2003). ROC curves are obtained by plotting the rate of correct decisions in favour of the first model against the rate of incorrect decisions (false positives) in favour of the first model (i.e. when the second model is true) for all possible values of the decision threshold.
Defining F (·) and G(·) as the cumulative distribution functions of scores under models 1 and 2, respectively, the general form of the ROC is given by (Pepe, 2003) as ROC(p) = F (G −1 (p)), where p denotes the rate of false positives in favour of the first model and G −1 (·) denotes the quantile function (van der Vaart, 1998) for scores under the second model. Assuming equal priors for the model indices, we can show that the Bayes factor, BF η , in Equation (??), is a function of the ROC curve constructed on the set of ∆{η(D), ·} from model 1 and the set of ∆{η(D), ·} from model 2: Equalities (4) and (6) come from Robert et al. (2011); the equality between (6) and (7) is a result of the definition of the empirical distribution function (Wasserman, 2013); and, the approximate equality between (7) and (8) results from the convergence of the empirical distribution function to the true distri-c 2 0 1 9 S D S U bution function as K and L get arbitrarily large, and from the assumption that K and L grow at the same rate given that π(M = 2) = π(M = 1). The relationship expressed between Equations (4) and (9) shows that the ABC Bayes factor for two alternative models of interest can be assigned using the ratio between ROC(p) and p as the rate of false positives in favour of model 1 approaches 0. This notable result allows us to express the convergence of the ABC Bayes factor as a function of the rate of false positives in favour of model 1. Our result has significant practical implications when it comes to using ABC for model selection: 1. Our solution allows to monitor the convergence of the output of the algorithm to BF η as function of a single, well-defined, measure, p, that only depends on the data generated under one of the considered models, as opposed to t which depends on both models and is usually set arbitrarily. 2. Our solution is less sensitive to the curse of dimensionality as it does not require any of the ∆{η(D), ·} to be close to 0. Indeed, our solution only considers the relative ranks of the ∆{η(D), ·} calculated for the data generated under models 1 and 2. 3. Since our solution is less sensitive to the curse of dimensionality, it can accommodate vectors of summary statistics of any length. It only requires the design of a kernel function that ensures that the distributions of ∆{η(D), ·} are well-separated under the competing models. This point is similar to the one made by , except that the separation, in our solution, can be studied on the real line as opposed to a high-dimensional space as suggested in . 4. Our solution uses the entire amount of pseudo-data generated and does so in a computationally efficient manner. 5. Finally, our solution allows to formally preserve the user's priors on the model indices.
Our solution requires estimation of the ROC curve, which is the topic of the next two sections.

Empirical ROC
We can leverage the relationship between Equations (4) and (9) to assign an ABC Bayes factor in several ways. Our first method is purely data driven and uses the following approximation of the ratio in Equation (9): We see that by defining t such that is constant for any N , and by increasing the number of simulations, the ratio in the denominator of Equation (10) will be driven to 0; hence, approximating the c 2 0 1 9 S D S U limit as p → 0 + in Equation (9). This approach has several major advantages as compared to current practice: 1. t is chosen only as a function of the distance scores between the value of the summary statistic of the observed data and the value of the summary statistic of the data generated under model 2 (versus all distance scores in other implementations of the ABC algorithm). 2. t is chosen such that the number of accepted distance scores under model 2 is fixed (versus a fixed value of t arbitrarily close to 0, or a varying value of t based on a fixed quantile of the empirical distribution of all the scores from models 1 and 2 combined).
For a given set of observed data, all current implementations of the ABC algorithm result in unpredictable variations of both the numerator and the denominator of the ratio in Equation (5) as N increases, which makes its convergence difficult to monitor. By fixing the rate of convergence of the denominator in Equation (5), our approach has the potential to better plan computing resources and monitor convergence.

The non-central dual beta ROC model
Our second approach extends further the relationship between the ABC Bayes factor and the ROC curve by noting that (Pepe, 2003): where f (·) and g(·) denote the probability density functions of distance scores under models 1 and 2, respectively. Assigning an ABC Bayes factor using Equation (13) requires evaluation of the first derivative of ROC(p), as p → 0 + , or the ratio of densities, f (·) g(·) , as p → 0 + . Following Metz, Herman and Shen (1998), Mossman and Peng (2016), and Chen and Hu (2016), we choose to fit a semi-parametric model to the empirical ROC curve obtained from the finite number of ∆{η(D), ·} generated by the algorithm. As noted by these authors, fitting a model to each of the score distributions makes the assumption that the scores follow this particular model. Instead, fitting a model directly to the ROC curve relies on the weaker assumption that there exists a monotonic transformation of the observed scores that results in scores whose distributions can be described by a simple model. Since the ROC curve is invariant to monotonic increasing transformations of the scores (Pepe, 2003;Metz, Herman and Shen, 1998;Mossman and Peng, 2016; c 2 0 1 9 S D S U Chen and Hu, 2016), the simple model can be used to represent the ROC curve without knowledge of the underlying score distributions. The common binormal representation of the ROC was considered (Pepe, 2003), however it can easily be shown that the limit of the binormal model as p → 0 + is either taking a value of 0, 1 or ∞. Instead we use a model based on two non-central beta distributions (Johnson, Kotz and Balakrishnan, 1995). Placing a restriction on the first shape parameter of each distribution (α F = α G = 1) results in the following semi-parametric model for the ROC curve where F (·) and G(·) are non-central beta distribution functions with parameters α F , β F , λ F , and α G , β G , λ G corresponding to F and G respectively. Note that contrary to current ABC methods for model selection based on pattern recognition algorithm, our method only requires the estimation of four parameters. The restriction on the first shape parameter of the densities guarantees that their ratio, as p → 0 + , produces a stable result for all parameter values within the support: Fitting the semi-parametric model requires the use of numerical optimisation techniques to estimate the parameter values. We use a two step fitting procedure: 1. Obtain an initial set of parameter estimates for β F , λ F , β G , λ G using a Maximum Likelihood Estimation approach for fitting a dual beta ROC model to continuous data (Metz, Herman and Shen, 1998;Mossman and Peng, 2016;Chen and Hu, 2016). 2. Refine these estimates using an objective function to minimise the L 2 norm between the semi-parametric model and the empirical ROC.
The first step of our procedure enables us to include information on which model produced each distance score ∆{η(D), η(D * (i) )}. However, during our implementation we found that the fit of the semi-parametric model to the empirical ROC curve could be improved from this method. Meanwhile, the L 2 norm objective function alone did not allow us to fit an adequate model when the score distributions overlapped heavily. Combining both processes gave the best results.
Once estimates for β F , λ F , β G , and λ G are obtained, it is trivial to use Equation (15) to assign the ROC-ABC Bayes factor. In practice, when a limited number of distance scores near 0 are observed, the quality of the fit of the ROC model near 0 can produce Bayes factors of meaningless magnitude (e.g., larger than 10 100 or smaller than 10 −100 ); these Bayes factors would vary wildly from one computation to another using the same observed data. We found that bounding the rate of false positives, p, to some low value, such as 10 −5 , before evaluating Equation (14) produced more robust ABC Bayes factors. c 2 0 1 9 S D S U 4. Weight of fingerprint evidence using ROC-ABC ABC for model selection possesses some obvious similarities with the algorithm proposed by Neumann, Evett and Skerrett (2012). To assign a Bayes factor to an observed fingermark, Neumann et al.'s 2012 algorithm considers two sets of scores. The first set contains scores measuring the level of dissimilarity between the observed fingermark and pseudo-fingermarks generated by Mr. X. The second set contains scores measuring the dissimilarity between the observed fingermark and pseudo-fingermarks originating from a sample of individuals from a population of potential alternative sources. The general idea behind Neumann, Evett and Skerrett (2012)'s model is that, if H 1 is true, Mr. X. would generate many more pseudo-fingermarks that are similar to the observed fingermark than the individuals in the population of alternative sources. If we define: 1. D using the observed fingermark, e u ; 2. models 1 and 2 in Equations (2) and (3) as the prosecution model and the defence model, respectively; 3. generative models corresponding to M 1 and M 2 ; 4. a kernel function, ∆{·, ·}, to compare pairs of finger impressions; we obtain an ROC-ABC algorithm to approximate a Bayes factor for fingerprint evidence. The ROC-ABC algorithm for fingerprint evidence is summarised in Algorithm 2. It will converge to the Bayes factor under the same conditions as discussed above (i.e. sufficient statistic across all models, optimal distance metric, infinite/large number of simulations, etc.). Generate a pseudo-fingermark, e * (i) u , by distorting the selected print using the sampled distortion parameters; Compute ∆{η(eu), η(e * (i) u )}; Assign the ROC-ABC Bayes factor using methods from Section 3.1 or 3.2.

Generation of pseudo-fingermark data
Implementation of the ROC-ABC algorithm for fingerprint evidence requires a model from which pseudo-fingermarks can be generated. We utilise the same fingerprint distortion model as Neumann, Evett and Skerrett (2012) to generate pseudo-fingermarks from any given control print. This model mimics the way fingerprint features are displaced as the skin on the tip of a finger is distorted when pressed against a flat surface. The parameter space of the model represents c 2 0 1 9 S D S U a wide variety of distortion directions and pressures. The model allows many different distortions to be produced from a single finger. Our distortion model assumes that fingers distort in the same way, independently of factors related to the donor (e.g., age, weight, profession) and to the finger number (e.g., right vs. left hand, thumb vs. index finger). The model obviously does not cover all possible distortion and pressure conditions, and donor related factors; however, it is currently the only option to obtain a large number of pseudo-fingermarks from any given individual.

Generation of pseudo-fingermarks under the prosecution model
When comparing fingerprints, an examiner first detects k features of interest on the fingermark. Second, the examiner compares it to the 10 control prints from the donor considered by the prosecution proposition and selects the finger appearing to be the most likely source of the mark. Finally, the examiner attempts to identify the most similar k corresponding features out of the n features present on the control print from the selected finger. Note that, once selected, the sets of features on the fingermark and the control print remain fixed for the duration of the experiment, and that the selection process results in a unique bijective pairing between the two sets of k features. Our algorithm assumes that this selection and pairing process has been done before generating pseudo-fingermarks under M 1 . When M 1 is selected by the algorithm, a pseudofingermark is generated from the k features selected on the control print using the distortion model. By construction, the features on this pseudo-fingermark have the same pairing with the features of the fingermark as the ones on the control print. The uncertainty on the type of the feature was modelled as described in Neumann et al. (2015). By repeating this process each time M 1 is selected, we can obtain a set of pseudo-fingermarks from the k features selected on the control print of Mr. X.

Generation of pseudo-fingermarks under the defence model
The defence proposition considers that, if the fingermark was not left by Mr. X, it must have been left by another person in a relevant population of alternative sources. Assuming that fingerprint patterns result from a completely random process during the development of the foetus, we can generate data under M 2 , first, by randomly selecting an individual from any representative sample of donors from the human population, and secondly, by generating a pseudo-fingermark from this individual's k minutiae configuration that is most similar to the k minutiae observed on the fingermark. As in the previous section, the features of this pseudo-fingermark have a unique bijective pairing with the features on the fingermark by construction. This process is repeated each time M 2 is selected to obtain a random set of pseudo-fingermarks from the population of potential donors considered by M 2 . Since it would be unrealistic to repeat manually the selection of the most similar k minutiae for each individual in a c 2 0 1 9 S D S U large sample, we use a commercially available automated fingerprint matching system to perform this task.

Summary statistic
Configurations of minutiae can be described numerically in the form of heterogeneous multi-dimensional random vectors containing the measurements summarised in Table 1.
Minutiae locations and orientations are taken with respect to a coordinate system based on the frame of the impression's picture (Figure 3 (a)). Different framings of the same impression result in different measurements for the same set of features. For this reason, the original measurements need to be summarised in a way that is rotation and translation invariant. Several invariant measurements capturing the spatial relationships between the minutiae in a configuration can be calculated, such as the distances between every pair of minutiae in the configuration (Figure 3 (b)) and the distance of each minutiae from the centroid of the configuration (average of Cartesian coordinates of the minutiae in the configuration) (Figure 3 (c)). A similar approach can be used to define invariant summaries of the direction of each feature by using fixed-length segments to represent minutiae directions and by taking the c 2 0 1 9 S D S U distances between the ends of these segments for every pair of minutiae in the configuration (Figure 3 (d)). We also choose to represent minutiae directions as a function of the axes going from the centroid through the location of each minutia (angles are measured counterclockwise) (Figure 3 (e)). Feature types can be directly compared between configurations without the need to summarise since types are rotation and translation invariant. All of these measurements can be brought together to create a vector of summary statistic of the original representation. For the example in Figure 3 with 7 features, a vector of summary statistic would include 21 cross-distances between pairs of minutiae, 7 distances between minutiae and the configuration's centroid, 21 cross-distances to capture the spatial representation of the minutiae directions, 7 angles and 7 types, for a total length of 63. For 10 features, the length of the vector of summary statistics would be 120; and for 15 features, it would be 255.
Given the heterogeneity and dimension of the measurements, it is unlikely that a sufficient summary statistic exists for fingerprint data. Here we adopt the approach which consists in pooling together as many individual summary statistics as possible in order to minimise the loss of information with respect to the original data. The curse of dimensionality is not a problem for our algorithm as discussed in Section 3. Furthermore, we note that, given that the model used to generate pseudo-fingermarks under M 2 is an extension of the model used under M 1 , as our pooled summary statistic tends to sufficiency we can assume that it will be sufficient to compare between M 1 and M 2 (Didelot et al., 2011). Nevertheless, the vectors of summary statistics described above were designed to illustrate the concept of the ROC-ABC in the context of fingerprints and can certainly be improved upon through further investigation.

Kernel function
Our ABC algorithm depends on a kernel function, ∆{·, ·}, which compares pairs of summarised configurations of minutiae. As mentioned in Section 3 and according to Marin et al. (2012), we wish to use a kernel function that best separates the distributions of ∆{η(e u ), ·} obtained under the competing models considered in Section 1.
We developed a metric to compare pairs of summarised configurations of minutiae, and optimised the weights of several components to best separate the two distributions of ∆{η(e u ), ·}. For completeness, we have included this development and optimisation process in Appendix A; however, we stress that other summary statistics, kernel functions and optimisation procedures could be considered without loss of generality of our proposed ROC-ABC method.

Number of simulations
For the purpose of this application, we limited the total number of pseudofingermarks generated for each test configuration to 500,000 (approximately imsart-generic ver. 2014/10/16 file: Hendricks_manuscript.tex date: January 14, 2020 c 2 0 1 9 S D S U 250,000 under each model). Assuming that we are interested in assigning the ROC-ABC Bayes factor for a specific fingermark in forensic casework, a much larger number of pseudo-fingermarks can be generated.

Datasets
The algorithm has been developed, optimised, and tested using several datasets. Each dataset comprises fingermarks or control prints captured digitally at 1:1 magnification and a resolution of 500 pixels per inch. All features were either labelled manually by an experienced fingerprint examiner or determined automatically by the encoding algorithm of the automated fingerprint matching system used in this study. The following features were extracted from every fingerprint used in the study: 1. the finger of origin of the impression, from 1 -right thumb -to 10 -left little finger (for control prints only); 2. the Cartesian coordinates of each minutia in pixels, using the bottom left of the image as the origin; 3. the direction of each minutia in radians, using the bottom left of the image as the origin and measuring counterclockwise; 4. the type of each minutia: ridge ending, bifurcation or unknown; 5. the Cartesian coordinates of the centre of the impression (for control prints only).
The four datasets (A, B, C, D) used in this study are described in Appendix B.

Results
Three experiments were performed using 4067 trace configurations ranging from 3 to 25 minutiae and sampled from 207 fingermarks. For each trace configuration, we consider in turn that:

TS:
Mr. X (the donor of the control configuration) is the true source of the trace configuration (dataset B); CNM: Mr. X is selected based on the similarity of his fingerprints to the trace configuration (dataset C); RS: Mr. X is randomly selected in a population of donors (dataset D).
In these experiments, the control configurations in datasets B to D were used to resample pseudo-fingermarks using M 1 , and dataset A was used to resample pseudo-fingermarks using M 2 . Results for all three experiments can be found in Figures 4 and 5. ROC-ABC Bayes factors presented in Figure 4 were assigned using the empirical ROC method discussed in Section 3.1. Figure 5 presents ROC-ABC Bayes factors that were assigned using the non-central dual beta ROC model described in Section 3.2. Note that the vertical axis in 5 has been truncated to focus on the mass of the distributions and that some extreme outliers may not be represented. c 2 0 1 9 S D S U In experiment TS, the control prints originate from the true sources of the marks, and so H 1 is true. Both figures show a similar behaviour where the magnitude of the ROC-ABC Bayes factor increases as the number of minutiae increases. In both cases, the ROC-ABC Bayes factors appear bounded. The bound for the empirical ROC-ABC Bayes factors stems from Equation (10), which requires us to set a value for p. In this case, we define p as the false positive rate evaluated at the 10 th smallest ∆{η(e u ), ·} generated under H 2 . This corresponds to approximately 1 in 25,000. The bound for the non-central dual beta ROC model stems from using the same p as in the empirical model. Bayes factors erroneously supporting H 2 are noted in both series of results for smaller configurations of minutiae (3 to 7 minutiae), which is not surprising as these configurations contain less discriminative information and are more likely to be observed in fingers from different individuals. Nevertheless, we note that only a handful of ROC-ABC Bayes factors support the wrong proposition for larger configurations. These cases deserve further investigation; at this time, we believe that they are related to configurations displaying unusual distortion that cannot be handled by the current generation of the distortion algorithm. Improvement of the summary statistic, kernel function, and distortion algorithm may be able to minimise further the number of cases where the Bayes factor erroneously supports H 2 . Finally, we observe that the range of values taken by the ROC-ABC Bayes factor for different configuration sizes overlap which supports the observations made by Neumann, Evett and Skerrett (2012) that there is no scientific justification for the use of a fixed number of minutiae as a decision point to distinguish between H 1 and H 2 , and that the evidential value of each configuration needs to be quantified based on its own characteristics.
In experiment CNM, the control prints originate from non-mated sources that are chosen due to their similarity to the traces, and so H 2 is true. In both Figures 4 and 5, we observe that a large proportion of the ROC-ABC Bayes factors erroneously support the hypothesis of common source, H 1 , although the algorithm using the empirical ROC appears to perform significantly better. The high rate of misleading evidence is not a surprise for low numbers of minutiae since it is not difficult to find multiple similar configurations on different fingers in large a dataset. The high rate of misleading evidence is more surprising for larger configurations. Larger values of the ROC-ABC Bayes factor when H 2 is true occur when the kernel function used by the algorithm considers the pseudofingermarks generated using M 1 more similar to the observed fingermark than they really are. As mentioned before, improvement of the summary statistic and the kernel function should significantly reduce the rate of misleading evidence in favour of H 1 . In practice, we believe that examiners comparing close nonmatching finger impressions would be able to exclude that they originate from a common source by visual inspection using friction ridge characteristics that are not taken into account by our metric.
In experiment RS, the control prints are obtained from randomly selected sources, and so H 2 is also true. In both cases the majority of observations correctly support the hypothesis of different sources, H 2 . As in the second experiment, the algorithm using the empirical ROC curve significantly outperforms c 2 0 1 9 S D S U the one based on the non-central dual beta model of the ROC curve.
Overall, we note that the semi-parametric modelling of the ROC curve needs to be improved. In many situations, it appears that the dual-beta model of the ROC curve is not optimal and that mixtures of beta distributions would better represent the data.
These experiments were repeated using the logistic regression method by Beaumont (2008) (see Appendix C). Results can be found in Figure 8 of Appendix C. It is important to note that the logistic method does not present an obvious upper bound for the results in experiment TS and assigns Bayes factors with notably large magnitudes. This property is not desirable since it may lead to unrealistic magnitude of support.
A comparison of the computational times for the empirical and parametric ROC methods and the logistic method is presented in Figure 9 of Appendix C. The empirical ROC-ABC method was without rival in terms of computation time. The logistic method performed at a much slower rate, and the difference between the two methods increases with the dimensionality of the data. When compared to a widely used ABC algorithm, we find that our method provides the high computational efficiency that is necessary to provide real-time calculation of the weight of forensic evidence in casework.

Discussion
The contribution of this paper is twofold. Firstly, we have proposed a method to rigorously quantify the weight of fingerprint evidence using the formal statistical framework provided by ABC. Secondly, our ROC-ABC algorithm helps with some of the issues commonly associated with ABC algorithms.
Overall, our results are consistent with the results presented by Neumann, Evett and Skerrett (2012) while capturing the user's belief about the parameters of the two competing models and providing an alternative to the weighting function that they proposed: 1. The probability of misleading evidence in favour of the defence decreases dramatically as the number of minutiae increases. 2. The probability of misleading evidence in favour of the prosecutor is very low for configurations greater than 7 minutiae, when the donor has been randomly selected. As expected, this probability is higher when the donor has been selected based on the similarity of its fingerprints with the fingermark. Improvements in the summary statistic used to describe fingerprint patterns and in the kernel function used to compare them, together with the ability of fingerprint examiners to account for more discriminative information than our model, will certainly reduce the rate of misleading evidence in favour of the prosecutor in an operational implementation of the model. identity of the source, and that the contribution of additional information regarding fingerprint pattern needs to be taken into account when determining the donor of a fingermark.
In addition to the straightforward operational implementation of our ROC-ABC approach, Figures 6 and 7 illustrate the powerful visual representation of the probative value of fingerprint comparisons that is offered by the ROC-ABC algorithm. In Figure 6, the latent print originates from same source as the control print and BF abc = 2.51×10 7 . The scores distributions and the ROC curve in the middle and right panels intuitively show that the data supports the hypothesis that the trace and control prints originate from the same source. In contrast, Figure 7 shows a situation where the same latent print as in Figure 6 is compared to a randomly selected control print. In this scenario, BF abc = 3.56 × 10 −153 . The clear support of the algorithm for the hypothesis that the latent and control impressions do not originate from the same source can be seen using the scores distributions and ROC curve in the middle and right panels. We believe that this intuition can easily be conveyed to jurors and other factfinders.
We do not claim that the kernel function that was used in this paper is optimal. It is worth exploring adaptive kernels that maximise the separation between the pseudo-data generated by both models in any specific case. Nevertheless, while the summary statistic and the metric used to generate the results presented in this paper can be improved upon, they are adequate to show the potential of the concept of the ROC-ABC algorithm. Operational implementation of the method would require further studies of the repeatability of the values obtained by the algorithm as a function of different samples (and different sample sizes) of the population of potential donors considered by H 2 .
Our proposed modification to the standard ABC model selection algorithm results in several improvements. Our approach, based on properties of the ROC curve, transforms the convergence of the algorithm into a function of the rate of false positives in favour of the model considered in the numerator of the Bayes factor. Most current implementations of ABC for model selection result in unpredictable variations of both the numerator and the denominator of the ABC Bayes factor as the number of simulations, N , increases, which makes the convergence of the algorithm more difficult to monitor. In our approach, the tolerance level, t, for a given data set is chosen such that the number of accepted samples under the model considered in the denominator of the Bayes factor is fixed for all N . Hence, as N increases, the approximation of the limit as the rate of false positives goes to 0 improves. Our approach has the potential to better plan computing resources. Critically, our method allows for rigorously monitoring convergence. The shift from tolerance level on ∆{η(D), ·} to rate of false positives in favour of model 1 does not require any of the ∆{η(D), ·} to be close to 0. Instead, only the relative ranks of the ∆{η(D), ·} calculated for the data generated under models 1 and 2 are considered. This implies that the kernel function used to assess level of similarity can accommodate vectors of summary statistics of any length, without encountering the curse of dimensionality because there is no need for any of the scores to be close to 0. Our algorithm only requires that the distributions of ∆{η(D), ·} are well-separated under the competing models. This point is similar to the one made by , except that the separation, in our solution, can be studied on the real line as opposed to a highdimensional space as suggested in . In addition to avoiding the curse of dimensionality, our method is also able to process the entire amount of pseudo-data generated in a computationally efficient manner and does not c 2 0 1 9 S D S U require filtering the pseudo-data: as the dimension of the summary statistic vector increases, the time required to assign Bayes factors using other methods (such as the logistic regression approach) increases exponentially (Figure 9). Finally, our solution allows to formally preserve the user's priors on the model indices.
We have not formally addressed the issue of the sufficiency of the summary statistic that is required for the convergence of the ABC Bayes factor to the Bayes factor. This convergence is extremely important in some contexts, such as forensic science, where fact-finders are as equally interested in the proposition supported by the Bayes factor as in the magnitude of this support. Since the method that we propose is not affected by the curse of dimensionality, it permits including as much information in the summary statistic vectors as needed as in Pudlo et al. (2016).
To implement our approach in practice, we have proposed two methods to assign ROC-ABC Bayes factors. Our results show significant differences in performance between the empirical ROC and the dual beta ROC method. The empirical model appears to produce more stable and meaningful results (i.e., ABC Bayes factors with reasonable magnitude). As we increase the number of simulations, the empirical model naturally approximates the limit p → 0 + in Equation (9). The non-central dual beta ROC model has the potential to explore the limit when p → 0 + with a smaller sample size. In practice, we observed that the values obtained using Equation (15) for multiple runs of the ROC-ABC algorithm for a given set of observed data differ greatly from one another (many orders of magnitude on the log 10 scale). It appears that Equation (15) is very sensitive to small changes in the values of the estimates of the model's parameters. Instead, we tried to fix p = 1 25,000 in Equation (14) to obtain more robust values and generate the data in Figure 5. Our results show that in several cases our algorithm does not support the correct model; this may be due to our modelling of the ROC curve in the neighbourhood of 0 not being an accurate representation of the data. Once again, this shows that the non-central dual beta ROC model is very sensitive to small changes in the estimates of its parameters. Improvements can be made to the fitting procedure for the non-central dual beta ROC model, such as an explicit monotonic increasing transformation of the distance scores to initiate the numerical optimisation procedure with distributions that are closer to the assumed model; alternatively, other models whose limits at 0 exist can be investigated.

Conclusion
In this paper, we propose an algorithm to formally and rigorously assign Bayes factors to forensic fingerprint evidence. Our modified ABC model selection algorithm was used to address several criticisms of the model proposed by Neumann, Evett and Skerrett (2012) by framing the problem into a formal Bayesian framework. The results presented here show that our method is promising, with low rates of misleading evidence, and has the potential to be applied to many other c 2 0 1 9 S D S U complex, high-dimension evidence forms such as shoe prints, questioned documents, firearms, and traces characterised by analytical chemistry. Ultimately, the widespread use of statistical approaches to quantify the weight of forensic evidence to replace the existing inference paradigm can only be enabled by technology providers offering commercial products to the forensic community. Our method leverages currently available technology that was designed to search forensic traces into large databases and retrieve the most likely candidates. For mainstream evidence types such as fingerprints, firearms, and shoe impressions, our algorithm can readily be implemented, validated, and integrated in current commercial offerings. Furthermore, we note that the use of ROC curves in the algorithm will be naturally familiar to engineers and scientists designing these systems, which may facilitate the implementation of our method in commercial systems.
As an added benefit, our algorithm addresses several shortcomings of current ABC model selection methods. We use the properties of the Receiver Operating Characteristic curve to address the issue of choosing a suitable tolerance level when assigning ABC Bayes factors. Our modification allows for a natural convergence of the algorithm as the number of simulations increases, and for monitoring this convergence as a function of the sole rate of false positives in favour of the model considered in the numerator of the Bayes factor. Focusing on the rate of false positives (rather than the tolerance level) allows our method to rely on the ordering of kernel scores, rather than the magnitudes of scores, and thus, is not subject to the curse of dimensionality. In addition, our method considers the entire amount of pseudo-data generated under the considered models in a computationally efficient manner. Since our algorithm can accommodate vectors of summary statistics of any length, we are not interested in performing any form of variable selection. However, we are interested in obtaining the "best" separation between the score distributions under the competing models in order to recover the correct model. This argument extend the one made by . Values for all c i in Equation (16) can be obtained by maximising the separation between the distribution of ∆{·, ·}'s from minutiae configurations generated by the same donor, and the distribution of ∆{·, ·}'s from minutiae configurations generated by different donors. To obtain the results presented later in this paper, we used numerical optimisation to maximise the average area under 450 ROC curves obtained from configurations with 5, 8, 12, 17, and 23 minutiae. Each ROC curve was based on 50,000 distance scores as calculated in Equation 16 and obtained by comparing k minutiae on a fingermark to pseudo-fingermarks resulting from the distortion of the true source of the fingermark (Section 4.1.1) and from the distortion of other fingers (Section 4.1.2). Our results indicated that the second and fifth components of our metric did not help discriminate between same-source and different-sources distance scores, and were assigned c 2 = c 5 = 0. Our results also indicated that component 3 was the most useful to maximise the average area under the ROC curve (c 3 = 6.5), followed by component 1 (c 1 = 1) and by component 4 (c 4 = 0.1).
We stress that other summary statistics, kernel functions and optimisation procedures could be considered without loss of generality of our proposed ROC-ABC method. H 2 . However, the logistic method does not present an obvious upper bound for the results in experiment TS and assigns Bayes factors with notably large magnitudes. Based on the convergence results shown in Equations (4) to (9), we do not believe that the larger magnitude of the Bayes factors in Figure 8 can be justified by the number of simulations performed in this experiment. We can only conclude that these Bayes factors severely overstate the weight of the evidence observed and generated in these cases. In addition, we note the very large variance of the Bayes factor assigned using the logistic regression method during experiment TS. This large variance results in a high rate of misleading evidence in favour of H 2 . This rate is noticeably greater than that from the empirical method.
Interestingly, results from experiments CNM and RS show that the logistic regression method produces less misleading evidence in favour of H 1 when H 2 is true, even when very similar prints are used. The logistic regression method maximises the separation between the two models by leveraging all of the content of the vectors of summary statistics, while the kernel function described in Section 4.3 has been optimised for the average case. The ROC-ABC method may be improved in this aspect by using an adaptable kernel function that would also maximise the separation in each case.
Overall, while using the logistic regression method as proposed by Beaumont (2008), we noted that the weighting of the pseudo-data prior to fitting the logistic regression model resulted in the removal of large portions, if not all, of the data generated under either M 1 or M 2 . As discussed previously, this results in altering the user-defined priors on the model index and replacing them by unpredictable data-driven priors. Furthermore, this led to instability when fitting the logistic regression model.
A comparison of the computation time among the three methods (empirical c 2 0 1 9 S D S U ROC, parametric ROC, and logistic regression) is presented in Figure 9. This computation time represents the time require to assign ABC Bayes factors using the three different methods once the pseudo-data has been generated. The empirical ROC-ABC method was without rival in terms of computation time.
Even as data complexity/dimensionality increased, computation time was relatively constant. The logistic method outperformed the parametric ROC-ABC method up until 9 minutiae. At this point, the total computation time for the logistic method continued to increase at an exponential rate while the computation time for the parametric ROC-ABC method remained fairly uniform. This is unsurprising since the computational complexity of the parametric ROC-ABC is driven by the number of univariate scores in the ROC curve, and not by the dimension of the vectors of summary statistics. In addition to an increase in computing time, an increase in computing resources was also required by the logistic regression method such that fingerprint data with more than 22 features could not be processed on a standard desktop computer, while the ROC-based methods have a very small computing footprint (once the initial pseudo-data has been generated).