Bayesian Quickest Detection of Credit Card Fraud

This paper addresses the risk of fraud in credit card transactions by developing a probabilistic model for the quickest detection of illegitimate purchases. Using optimal stopping theory, the goal is to determine the moment, known as disorder or fraud time, at which the continuously monitored process of a consumer’s transactions exhibits a disorder due to fraud, in order to return the best trade-off between two sources of cost: on the one hand, the disorder time should be detected as soon as possible to counteract illegal activities and minimize the loss that banks, merchants and consumers suffer; on the other hand, the frequency of false alarms should be minimized to avoid generating adverse effects for cardholders and to limit the operational and process costs for the card issuers. The proposed approach allows us to score consumers’ transactions and to determine, in a rigorous, personalized and optimal manner, the threshold with which scores are compared to establish whether a purchase is fraudulent.


Introduction
Payment habits have changed dramatically over the past thirty years thanks to new technologies. Nowadays a growing number of purchases, also of small amounts, are paid by credit and debit cards. While these electronic payment methods boost business and make the lives of buyers easier, they also heighten the fraud risk borne by the payments industry. As shown by the Nilson Report (2017), in 2016 worldwide fraud losses amounted to 7.15 cents per $ 100 of card transactions; the card total volume was $ 31.878 trillion, while fraud losses reached $ 22.80 billion and the latter amount has been predicted to double by 2025. The total cost is even greater when the consequences of card fraud are considered: banks and card issuers make investments in anti-fraud technologies and bear the losses incurred by their clients; merchants sustain high cost to guarantee their customers high standard of security and can be charged back by card issuers if any negligence during a transaction occurs; finally, clients, often refunded by banks when victimized by fraud, are frustrated when their cards are blocked unnecessarily. Thus, counteracting credit card fraud is in the interest of all these actors.
Credit card fraud occurs whenever a credit card is used without the consent of its legitimate owner with the aim of either making purchases or stealing money. As first defense level against fraud, card issuers have developed authentication measures, such as the check of numerical codes (like the card PIN or the cardholder's zip code), signature and fingerprint verification systems and the 3D secure scheme (an authentication method that requires a cardholder either to insert a temporary generated password for finalizing her on-line transaction or to authorize the transaction over a second channel). However, since fraudsters dynamically adapt their strategies to the latest antifraud technologies, authentication measures may fail. Then, as second defense level, card issuers make use of detection measures to discriminate between legitimate and fraudulent transactions; currently employed fraud detection techniques are discussed in Section 2. Here, let us recall that these techniques are supervised methods, namely they are calibrated on a training sample of transactions which are labeled as legitimate or fraudulent: when a new transaction arrives, the trained model predicts its class. In particular, a suspicion score between 0 and 1 is assigned to each transaction, which is subsequently declared as fraudulent when the score is higher than a certain threshold.
In the literature this threshold is determined empirically: for example, in Bhattacharyya et al. (2011); Mahmoudi and Duman (2015); Quah and Sriganesh (2008); Srivastava et al. (2008); Zaslavsky and Strizhak (2006) default values, such as 0.3 or 0.5, are used; in Jurgovsky et al. (2018), the threshold is fixed so that the training set is characterized by a predetermined true positive rate; in Carneiro et al. (2017), the threshold is such that either in the training set the false positive and the true positive rates are equal or a predetermined percentage of the top most rated transactions is labeled as illegal. In any case, to the best of the authors' knowledge, there is no formal theory which justifies these choices and in Carneiro et al. (2017) it is said that "it is critical to choose the score threshold for considering an order to be legitimate or fraudulent". This paper aims therefore at introducing a probabilistic model for the quickest detection of credit card fraud where for each transaction the posterior probability of being fraudulent is returned and a personalized threshold for each cardholder is optimally determined. The unobservable disorder or fraud time, at which the continuously monitored process of a consumer's transactions exhibits a disorder due to fraud, can be estimated as the first time the posterior probability process exceeds the threshold. This is the optimal stopping time, which minimizes the expected trade-off between the probability of having a false positive and the detection delay since the occurrence of the fraud.
The quickest detection problem of a change in the probabilistic features of a stochastic process has been widely studied. In Shiryaev (1978, Chap. 4.4) the early detection of a change in the drift of a Brownian motion was analyzed and was extended to a finite horizon formulation in Gapeev and Peskir (2006) and to other diffusion processes in Johnson and Peskir (2017); Gapeev and Shiryaev (2013). Partial solutions for the detection of a shift in the intensity of a Poisson process are in Davis (1976); Gal' Chuk and Rozovskii (1971), while a complete solution was provided in Peskir and Shiryaev (2002). The latter is the basis of this article, where, according to Schmittlein et al. (1987), it is assumed that the observed process of a card user's expenditures follows a compound Poisson process, whose arrival times are the purchase times and whose jumps represent the corresponding amounts and the geographical coordinates. This process will change its intensity and jump distribution when hit by fraud, which we detect by resorting to the algorithm developed in Dayanik and Sezer (2006). We underline that this is the first time that the optimal stopping theory (see, e.g., Peskir and Shiryaev (2006) ;Shiryaev (1978)) and the results of the aforementioned articles are applied to credit card fraud detection. Further results on the quickest detection for compound Poisson processes were obtained in Bayraktar and Dayanik (2006); Bayraktar et al. (2005); Buonaguidi and Muliere (2015); Gapeev (2005); Herberts and Jensen (2004).
The article is organized as follows. In Section 2 we briefly recall the data mining techniques currently used in credit card fraud detection. In Section 3 we introduce three Bayesian quickest detection models and we describe how we can adapt them to credit card fraud detection. We analyze the optimal strategy to raise the alarm of fraud and we see that it is the first time a functional of the observed cardholder's expenditures pattern, known as posterior probability process or, equivalently, (generalized) odds process, exceeds a threshold; we also describe the algorithm that can be used to compute it. In Section 4, using real credit card transactions provided by one leading company in the Swiss credit card market, we estimate the pre and post-fraud expenditure distribution parameters of the cardholders and these values will be used to compute their optimal thresholds. Then, we assess how our methodology performs in classifying new legitimate and fraudulent credit card transactions (both simulated and real); performance of our models will be derived, discussed and compared with that of other methodologies. Section 5 contains a summary discussion and concluding remarks.

Literature review on credit card fraud detection
Data mining refers to the discovery of patterns and relationships in a huge amount of data and is widely used to screen credit card transactions to detect fraud. Detection must work in real time: when the details of a transaction are received by a credit card issuer, the latter must decide within a few milliseconds if the transaction must be authorized or not. This step is of key importance, because approving a fraudulent operation implies the loss of the corresponding amount; on the other side, rejecting a legal purchase creates disturbances for a cardholder. In this section we recall some of the most popular data mining methodologies employed in credit card fraud detection; literature reviews are also presented in Bolton and Hand (2002); Ngai et al. (2011).
Logistic regression and rule-based methods are well known and among the first techniques employed in fraud detection due to their simplicity. Logistic regression is just a special case of the generalized linear model; in rule-based methods, rules are either established by experts on the basis of prior analysis or extracted from decision trees. Applications to credit card fraud of logistic regression and rule-based methods can be found in Bahnsen et al. (2016); Bhattacharyya et al. (2011);Carneiro et al. (2017); Yeh and Lien (2009) and Bahnsen et al. (2016); Mahmoudi and Duman (2015); Yeh and Lien (2009), respectively; we also refer to Letham et al. (2015) for a Bayesian approach to rule-based methods.
Ensemble methods are used to improve the classification accuracy. They are made up of an aggregation of different classification models: a training set is used to create training subsets and on each of them a model is calibrated. When a new transaction arrives, each model returns a class prediction, which is used together with the predictions of the other models to determine the class of the ensemble. Boosting and random forests are two examples of ensemble methods. In boosting, used in Chan et al. (1999), the models are trained sequentially: the transactions which have not been correctly classified in the previous model are weighted more in the next one, in order to give more importance to the misclassified cases. The ensemble class is a weighted average of each model class, whose weight depends on how well the model performed. A Bayesian version of boosting, known as BART (Bayesian Additive Regression Trees), was proposed in Chipman et al. (2010). Random forests are an ensemble of decision trees built on sub-samples randomly drawn with replacement from the original training set. The class that a random forest assigns to a new transaction is the mode of the classes predicted by the single decision trees. Random forests were applied in Bahnsen et al. (2016); Bhattacharyya et al. (2011);Carneiro et al. (2017) and used in a Bayesian inference setting in Raynal et al. (2019). Unlike the rule-based and decision tree methods, ensemble methods are less prone to over-fitting but their interpretation is more complex.
A hidden Markov model is a stochastic process with two hierarchical levels: the inner one is represented by a finite number of states and is hidden, namely not observable, while the outer one is the observable outcome generated in correspondence to a given state. Probabilities governing the transition among states and probabilities with which outcomes are generated are the model parameters. Hidden Markov models were used for credit card fraud detection in Srivastava et al. (2008): for each purchase type (the hidden state), the price range (low, medium and high) is observed. The model works as follows: after the parameters estimation, a new transaction arrives and its price range is passed to the model; the latter computes the probability that the transaction is characterized by the observed price range. When this probability is too low, the transaction deviates from the normal behavior and is therefore identified as fraudulent. This methodology has a nice probabilistic interpretation, but its structure (number of states and number of outcomes for each state) needs to be carefully adapted. We refer to Ko et al. (2015) for a Bayesian approach to hidden Markov models in change-point problems.
Support vector machines are techniques used to separate data. Data can be either linearly separable, when there exists a hyperplane that divides all the data of one class from the data of the other class, or linearly inseparable, when such a hyperplane does not exist. However, in the latter case, using a non linear mapping, the original data can be mapped to a higher dimensional space where the transformed data becomes linearly separable. Support vector machines aim at searching for the best hyperplane separating the training data, namely the hyperplane with maximal margin of separation between the edges (the so called support vectors) of the two classes. Once a support vector machine has been trained on a set of credit card transactions, a new transaction is classified as fraudulent or legitimate depending on which of the two portions of the space, determined by the estimated hyperplane, the explanatory variables lie. Limits of this methodology are the choice of the function that maps linearly inseparable data to linearly separable ones and the specificity of this function to the addressed problem. Support vector machines were investigated in Bhattacharyya et al. (2011);Carneiro et al. (2017); Mahmoudi and Duman (2015) and in Polson et al. (2015) their Bayesian version was provided.
Neural networks have become very popular among card issuers thanks to their ability to extract solutions from highly involved problems. A neural network is always characterized by a set of nodes, or neurons, connections among the neurons and a function which weights these connections. Neurons are placed in one or more layers: each neuron of one layer receives inputs from the neurons of the previous layer and combines this information with the weights of its connections. This operation propagates up to the neurons of the last layer, which return the final output. A set of labeled credit card transactions is used to train the model: an error function measures the distance between the output of the model and the true output. As this distance function depends only on the weights of the connections among neurons, the weights minimizing the error are searched by means of optimization algorithms. Despite their good performance, neural networks have some drawbacks: they are a "black box", in the sense that the function that they aim at optimizing cannot be inferred from the network structure; their topology (number of neurons and layers) strongly depends on the specific problem to be addressed; optimization algorithms do not always converge to the optimal set of weights that minimize the error function (see, e.g., Bishop (2006, Secs. 5.2.1 and 5.5) and Hastie et al. (2009, Secs. 10.7 and 11.5.4-11.5.5)). Neural networks were applied in Dorronsoro et al. (1997); Jurgovsky et al. (2018); Mahmoudi and Duman (2015); Quah and Sriganesh (2008); Yeh and Lien (2009);Zaslavsky and Strizhak (2006). A Bayesian perspective on neural networks and on their connection to statistical data reduction techniques was given in Polson and Sokolov (2017); methods of Bayesian optimization for hyperparameter selection in neural networks (as well as in logistic regression and support vector machines) were studied in Snoek et al. (2012).

Methodology description
A common feature of the methodologies analyzed in Section 2 is that they return a suspicion score in [0,1] on how likely a transaction is fraudulent. Then, a transaction will be labeled as fraudulent if the score exceeds a fixed threshold. However, there is no theory which explains how to compute it optimally and, as already said in Section 1, it is usually fixed empirically. In the following sections, we introduce our model, which provides a rigorous and personalized method to determine a threshold and the associated optimal strategy for each cardholder, so that the trade-off between the losses from detecting fraud too early or too late are minimized.

The model
We describe our model following the lines in Peskir and Shiryaev (2002). On the measurable space (Ω, F ) the random variable θ is defined with respect to a family of probability measures (P s ) s≥0 , such that P s (θ = s) = 1. θ represents the so called fraud or disorder time, at which the expenditures pattern X := (X t ) t≥0 of a cardholder changes its statistical features due to fraud. According to the hypothesis in Schmittlein et al. (1987) (see also Glady et al., 2009), we assume that X is a compound Poisson process: (3.1) In the expression above, N := (N t ) t≥0 is a standard Poisson process, which models the purchases time, and {Y j } j≥1 is a sequence of independent and identically distributed R d -valued random variables, representing, for example, the amount of the transactions and their geographical coordinates. At the disorder time θ, N changes its arrival rate from λ 0 to λ 1 and {Y j } j≥1 switch their common distribution from v 0 (·) to v 1 (·). It is assumed that λ i > 0 and v i (·), defined on (R d , B(R d )), are all known, i = 0, 1, and that v 1 (·) is absolutely continuous with respect to v 0 (·).
Next, we define the probability measure where π and λ are given and fixed. Expression (3.2) contains our prior belief about θ, that, under P π , takes value 0 with probability π and, with probability 1 − π, is exponentially distributed with mean 1/λ. Since fraud is not directly observable, the best we can do is detecting θ through a strategy based on the continuous monitoring of X. Let F X t ⊂ F be the sigma-algebra generated by X up to t; then our goal is to determine a stopping time τ with respect to F X = (F X t ) t≥0 which is as close as possible to θ. Formally, the Bayesian quickest detection problem aims at computing one of the following risk functions and obtaining the optimal stopping time at which the infimum on the right-hand side of (3.3)-(3.5) is achieved. In the above expressions, c i , i = 1, 2, 3, and α are positive and given values and (x) + := max{0, x}. In (3.3) the probability of a false positive (stopping before the fraud time) and the expected linear detection delay are combined: the longer the observation of X, the lower the probability of raising a false alarm, but the higher the delay in detecting θ; moreover, c 1 weights the importance assigned to these two sources of costs. Analogous interpretations hold for (3.4), where the expected advance in detecting θ replaces the probability of a false alarm, and for (3.5), where the expected exponential detection delay is considered (see, e.g., Beibel, 2000;Poor, 1998), with α being the internal rate of return at which the losses due to a detection delay are compounded. It will soon be evident that the structure of the solutions to (3.3)-(3.5) is similar, as already noticed in Bayraktar et al. (2005); Davis (1976).

The optimal stopping problem
We are going to see that problems (3.3)-(3.5) can be reduced to an equivalent optimal stopping problem for a one-dimensional Markov process. Let us introduce the processes Π := (π t ) t≥0 , ϕ := (ϕ t ) t≥0 and Φ α := (Φ α,t ) t≥0 defined by Since π t is the posterior probability that fraud has already occurred at time t given all the past history of X, Π is called posterior probability process; ϕ is known as odds process because ϕ t is the odds of π t . Φ α is the generalized odds process because Φ α = ϕ when α = 0; for this reason, in the sequel we will refer to Φ α only. Resorting to stochastic calculus and standard arguments based on Dayanik and Sezer (2006); Peskir and Shiryaev (2002), it is easy to show that Φ α satisfies the following stochastic differential equation: (3.8) The dynamic in (3.7) shows that Φ α is a strong Markov process. Adapting the results in Johnson and Peskir (2017) to Φ α , it is possible to show that problems (3.3)-(3.5) are equivalent to: (3.9) where α = 0, when i = 1, 2, and (3.10) The infimum in (3.9) is taken over the stopping times of Φ α , that coincide with those of X, as evident from (3.7). Further, unlike (3.3)-(3.5), where the expectation is under P π , in (3.9) the expectation is under P ∞ φ , the probability measure under which fraud never occurs, namely θ = ∞, conditional to the event {Φ α,0 = φ}, for φ ≥ 0. It is immediate to see that U i ≤ 0, because τ = 0 is an admissible stopping time, and U i ≥ −k i /λ, which arises from never stopping (i.e., τ = ∞) and the fact that Φ α takes positive values. We also see that it is never optimal to stop before Φ α reaches k i , because, before that moment, the integrand in (3.9) remains negative. Indeed, it is well known that there exists a threshold B i ≥ k i such that the optimal stopping time in (3.9) is given by (3.11) which is the first moment at which Φ α exceeds B i (see Bayraktar et al., 2005;Buonaguidi and Muliere, 2015;Gapeev, 2005;Gapeev and Shiryaev, 2013;Johnson and Peskir, 2017;Peskir and Shiryaev, 2002;Shiryaev, 1978). From (3.11) we observe that the optimal threshold is independent of π, the prior probability that fraud occurs immediately, and this is consistent with the general optimal stopping theory (Peskir and Shiryaev, 2006;Shiryaev, 1978). Further, from (3.6) and the fact that α = 0 when i = 1, 2, we have that (3.11) is equivalent to

The algorithm
Solving the Bayesian quickest detection problems (3.3)-(3.5) boils down to computing the function U i in (3.9) and the threshold B i in (3.11), i = 1, 2, 3 (for a simpler notation, the index i will be omitted). This can be done by resorting to the iterative procedure provided in Dayanik and Sezer (2006).
When a credit card transaction is made, X from (3.1) has a jump. Denoted by {σ n } n≥1 the jumping times of X, let us notice that they coincide with the jumping times of Φ α , as the second addend in (3.7) shows. In particular, Equation (3.7) also shows that Φ α solves, between two successive jumps, the first order linear differential equation for t ∈ [σ n , σ n+1 ). Φ α is therefore a piecewise deterministic Markov process: between any two subsequent jumps it follows the deterministic flow t → x(t, φ), being φ ≥ 0 the starting point of the process after a jump (see, e.g., Davis, 1993).
Let us consider the family of optimal stopping problems where the infimum is taken with respect the stopping times of Φ α,t and we integrate up to the minimum between a stopping time τ and σ n . In order to exploit the piecewise deterministic Markov property of Φ α , let us define the operators J : is the set of bounded and continuous functions on [0, ∞). Then, we can compute sequentially the functions u that satisfy the property u (n) = U (n) , n ≥ 1, and lim n→∞ u (n) = U . Observing that σ 1 has exponential distribution with mean 1/λ 0 under P ∞ and using Fubini's theorem, (3.18)-(3.19) read more explicitly as Finally, the threshold B in (3.11) is given by Practically, the previous computations end when n is sufficiently large. The technical details on the implementation of the illustrated approximation scheme can be found in Dayanik and Sezer (2006, Sec. 5) and are also reported in the Supplementary Material (Buonaguidi et al. 2020a;2020b).

Experimental setup
In this section, we calibrate the quickest detection models in (3.3)-(3.5) on a real set of credit card transactions and we test them on simulated and real datasets. Common performance measures will be computed to evaluate the predictive power of the proposed methodology.

The dataset
One of the most important Swiss credit card issuers, with more than 1.5 million issued cards and more than 100 million transactions authorized every year, provided us with a vast dataset of real credit card transactions, including Internet purchases. This dataset covers a six-month period, from June to November 2016, and contains the details of 124,770 authorized transactions, which pertain to 4,077 different cardholders. Each transaction has the following attributes: BaseCardID, the identification code of a cardholder, which remained the same even if she replaced her card during the considered period (cardholders had been completely anonymized); RecordDateTime, the date and the time at which the operation took place; TrxAmount, the transaction amount in the currency of the issuer; MerchantLocation, the location of the merchant; isTrxFraud, a flag indicating whether the transaction was fraudulent. The latter attribute is created by the card issuer a few days after the transaction, which is identified as fraudulent by means of the analysis of fraud experts and the confirmation of the cardholders: when the fraud team manually revises suspicious transactions, the additional information to which the team members have access, such as the merchants identification number and the merchants category (restaurant, pharmacy, ATM, etc), allows them to identify illegal purchases reliably; then, the legitimate cardholders are contacted for their confirmation on the fraudulent nature of the transactions. In the dataset 2,778 transactions were labeled as fraudulent, implying a fraud ratio of 2.23%. Actual fraud ratios are much lower than this value; however, in our dataset, fraudulent transactions have been overweighted to mitigate the problem of data skewness, occurring when the legitimate cases far outnumber the fraudulent ones.
By means of the BaseCardID, transactions were grouped and sorted in ascending order by the RecordDateTime. A new attribute ElapsedDays was obtained as the number of days between two consecutive operations; for the first transaction of each cardholder, ElapsedDays was set equal to 0. This attribute has been derived because when a fraudster steals a credit card or the associated sensible information, he usually attempts to make as many transactions as possible in a narrow window of time, before the fraud is detected and the card is blocked. Accordingly, the variable TrxAmount has a key role in fraud detection, because fraudsters try to maximize spending before they are discovered, as suggested in Bhattacharyya et al. (2011); Bolton and Hand (2002). The importance of the time elapsed between transactions and their amounts was also underlined in Carneiro et al. (2017, Table 6). The MerchantLocation attribute has been finally used to derive the geographical coordinates of the associated transactions: Latitude and Longitude. Indeed, fraudsters perpetrate their activities in places which are often different from the ones where cardholders make their legitimate purchases. Then, knowing where a transaction took place is relevant for a more efficient identification of fraudulent behaviors. Table 1 shows the final structure of our dataset.

Models calibration
Calibrating the Bayesian quickest detection models (3.3)-(3.5) means (i) establishing the parameter λ in (3.2), governing the prior distribution of the fraud time θ, (ii) determining the constants c i , i = 1, 2, 3, and α in (3.3)-(3.5), (iii) estimating the quantities (λ 0 , v 0 (·)) and (λ 1 , v 1 (·)) for the cardholder's expenditure process (3.1) and (iv) computing the optimal threshold B i , i = 1, 2, 3, in (3.11). For point (i), we relied on fraud experts prior knowledge for a reasonable value of λ. In (ii), the values c i , i = 1, 2, 3, are chosen by the models user, that, according to her needs, may decide to weigh more or less heavily the detection delay; α is still chosen by the models user on the basis of the interest rate at which the losses due to detection delays are compounded. For the quantities in (iii), we adopted the following approach: for each cardholder, her legitimate transactions (the ones where the attribute isTrxFraud takes value 0) are used to estimate her own arrival rate λ 0 and jumps distribution v 0 (·); λ 1 and v 1 (·) are estimated only once on all the fraudulent transactions in the dataset, meaning that we assume the existence of a representative fraudster who may potentially act against all the cardholders. The latter assumption is motivated by the fact that fraudulent transactions data do not discriminate among different fraudsters and fraud represents a tiny percentage of the total number of purchases: hence, we need to aggregate all the available information to reliably estimate at least one representative fraudster behavior. More details on the estimation procedure under different "information schemes" will be given in this section. Finally, the personalized optimal threshold at point (iv) depends on the quantities in (i)-(iii), as shown by (3.11), and is determined for each cardholder through the algorithm of Section 3.3.

Observing the elapsed days only
When the date and the time of a transaction are the unique features that we can observe and, consequently, only the ElapsedDays attribute in Table 1 is available, the cardholder's expenditure process X in (3.1) becomes a simple Poisson process. This formally arises from setting Y n = 1 P s -a.s., n ≥ 1 and s ≥ 0, so that v i (dx) = δ 1 (x)dx, i = 0, 1, where δ 1 (·) is the Dirac measure; it implies that f (x) = 1 {1} (x) in (3.8).
Let K 1 be the set of indexes relative to the subsequent fraudulent transactions in the derived dataset and, similarly, let K 0 be the set of indexes associated to the consecutive legitimate purchases of a given cardholder. Then, since the inter-arrival times of a Poisson process are independent and follow an exponential distribution with mean 1/λ 0 and 1/λ 1 , for the legitimate and fraudulent cases respectively, λ i can be determined as the maximum likelihood estimatê where |K i | is the cardinality of the set K i . We obtainedλ 1 = 3.012032, meaning that when a fraudster steals a credit card, he tries to make about three transactions per day on average; the maximum likelihood estimates of λ 0 for the cardholders in the dataset range in the interval [0.005769, 2.776012]. Then, legitimate transactions occur less frequently than the fraudulent ones. This fact can also be inferred from Figure 1, where the cumulative distribution function of the elapsed days between all the pair of consecutive fraudulent transactions of our dataset is compared with that of a sample of subsequent legitimate purchases. Figure 1: Comparison between the cumulative distribution function of the elapsed days between two consecutive fraudulent transactions (in red) and the one of the elapsed days in a sample of 2,000 subsequent legitimate purchases (in green).

Observing the elapsed days and the transaction amounts
When also the transaction amounts are available, in the compound Poisson process (3.1) the sequence of random variables {Y j } j≥1 can be used to model the purchase expenditures. Letting {Y j } j≥1 represent the logarithm of the variable TrxAmount, we assumed that, before and after fraud, Y j follows a Gaussian distribution with mean and variance μ i and σ 2 i , i = 0, 1, respectively. Then, where N 1 (x; μ i , σ 2 i ) is the univariate Gaussian density with mean and variance μ i and σ 2 i evaluated at x ∈ R. Denoted by H 1 the set of indexes of all the fraudulent transactions and by H 0 the set of indexes of all the legitimate purchases for a given cardholder (let us observe that K i ⊆ H i , i = 0, 1), μ i and σ 2 i can be computed by resorting to the maximum likelihood estimatorŝ For the fraudulent transactions, we obtainedμ 1 = 4.095233 andσ 2 1 = 3.124095. The left panel of Figure 2 reports the histogram of the logarithmic amounts of the fraudulent transactions and compares it to the estimated Gaussian density. The right panel plots the estimated pairs (μ 0 , σ 0 ) characterizing the Gaussian distribution of the logarithmic legitimate amounts of all the cardholders. The intensities λ 0 and λ 1 are estimated according to (4.1).

Observing the elapsed days, the amounts and the geographical coordinates
The last and full informative scheme we consider is the one where the geographical coordinates are also available. The cardholder's expenditures process X in (3.1) becomes now multivariate, since the random variables {Y j } j≥1 are used to model the logarithm of the amounts, the longitude and latitude of the transactions and take therefore values in R 3 . Since consumers make the majority of their purchases in a few selected places, we decided to model the coordinates through mixtures of Gaussian distributions. Amounts are assumed to be independent of the coordinates; then, we extend (4.2) to (4.4) where N 2 (y; η i,j , Σ i,j ) is the bivariate Gaussian density of the j-th mixture component with mean vector and covariance matrix η i,j and Σ i,j , respectively. In (4.4), n i represents the number of components, the so called clusters, of the mixture and p i,j is the probability that an element belongs to component j, and is such that ni j=1 p i,j = 1. From the expression above, we easily find that the Radon-Nykodym derivative f (·) in (3.8) takes the form The maximum likelihood estimators of λ i , μ i and σ 2 i , i = 0, 1, are still given by (4.1) and (4.3). Once the numbers of mixture components, n 0 and n 1 , are chosen, η i,j , Σ i,j and p i,j , j = 1, . . . , n i , i = 0, 1, can be estimated by resorting to the EM-algorithm (Dempster et al., 1977). For the longitude and the latitude of the fraudulent purchases, we used a bivariate Gaussian mixture model with n 1 = 6 components, whose induced clusters are shown in Figure 3. For each cardholder, a bivariate Gaussian mixture model was estimated on the coordinates of her legitimate transactions. We initially fixed n 0 = 3 components; if the algorithm failed to converge (because, for example, transactions were concentrated in a very few or just one region), n 0 was decreased to 2 or 1. Figure 3: Clusters of the fraudulent transactions obtained by estimating a Gaussian mixture model with six components on their geographical coordinates. Cluster 1 (red) is centered in Australia and its weight is p 1,1 = 0.027. Cluster 2 (yellow) embraces Asian countries with weight p 1,2 = 0.057; Cluster 3 (green) comprises south American and south African countries and its weight is p 1,3 = 0.041. Cluster 4 (light blue) includes central and south European Countries, with weight p 1,4 = 0.202. Cluster 5 (blue) mainly refers to the United States and has weight p 1,5 = 0.279. Cluster 6 (violet) mainly refers to the United Kingdom and has weight p 1,6 = 0.393.

Some notes on the thresholds computation
We set λ = 1/365 in (3.2), namely in our prior belief, as elicited by experts, a cardholder suffers, on average, an attempt of fraud once per year. The algorithm of Section 3.3 for the computation of the optimal threshold in (3.11) was applied for each of the previous information schemes, for each of the models in (3.3)-(3.5) and for each cardholder. For each of the models (3.3)-(3.5), which we refer to as "linear", "expected miss" and "exponential", respectively, different values of c i , i = 1, 2, 3 were used. For example, in the linear problem (3.1), the cases of c 1 = 0.1 and c 1 = 0.2 were considered; in the expected miss problem (3.4), we first set c 2 = 10 and then c 2 = 50; in the exponential case (3.5), we first considered c 3 = 1,000 and then c 3 = 2,000. In the exponential case, α was set equal to 1.3367×10 −4 , which is equivalent to an annual internal rate of return of 5%. Information on these parameters is usually available to a credit card issuer and in any case can be obtained on the basis of statistics of previous months/years.
Let us finally make four remarks: (1) when {Y j } j≥1 contain more features than what are considered in our analysis, the assumption of independence in part of these explanatory variables could ease the estimation of v 0 (·) and v 1 (·); (2) for a given model and cardholder, the algorithm complexity increases with the informative scheme: when only the elapsed days are considered, the integral in (3.21) disappears (because v 0 (·) concentrates all its mass on 1), so that the optimization problem in (3.20) can be quickly solved; instead, when the amounts or both the amounts and the coordinates are considered, the integral in (3.21) is one-dimensional or three-dimensional, respectively, and this slows down the solution procedure of (3.20); (3) we wrote the code in Matlab and we used a standard 2017 laptop for the computations: the estimation of (λ 0 , v 0 (·)) took about 0.002, 0.002 and 0.009 seconds on average for each cardholder when the elapsed days, the elapsed days and the amounts, and the elapsed days, the amounts and the coordinates are observed, respectively. These times rose to 21, 146 and 456 seconds when also the cardholder specific optimal threshold from (3.11) was computed; (4) at first sight, the just reported execution times are relevant. However, they can be easily managed if we consider that the algorithm of Section 3.3 applies independently to each cardholder and, therefore, can be parallelised among cardholders for faster and more efficient computations, as well as the fact that in practice fraud models would need to be implemented in high-performance computing languages and are usually re-trained less than once a month.

Models testing on simulated transactions
In order to assess the goodness of the models (3.3)-(3.5), 20 datasets of transactions were simulated. Each of them has the same structure reported in Table 1 and was generated in the following way: for each cardholder, 50 transactions were simulated and the flag indicating their legal or illegal nature was extracted from a Bernoulli distribution having parameter 0.1 (i.e., about 10% of the dataset transactions are fraudulent); the fraudulent transactions always occur later than the set of the legitimate ones. According to the representation in (3.1), for any legitimate cardholder transaction, the attribute ElapsedDays was extracted from an exponential distribution with mean 1/λ 0 , the variable TrxAmounts was taken to be the exponential of a Gaussian random number with mean and varianceμ 0 andσ 2 0 , and the Longitude and Latitude attributes were generated from a mixture of bivariate Gaussian densities. The parameters characterizing all these distributions are cardholder specific and were obtained during the calibration step as discussed in Section 4.2. For all the fraudulent transactions, the ElapsedDays variable was drawn from an exponential distribution with mean 1/λ 1 , the logarithm of TrxAmounts was simulated from a Gaussian density with mean and varianceμ 1 andσ 2 1 and the Longitude and Latitude variables were simulated according to the mixture of bivariate Gaussian densities of Figure 3; let us recall that these fraudulent distributions are not cardholder specific.

Scoring
In each of the 20 simulated datasets, we computed, for each cardholder and each informative scheme, the dynamic given by (3.13)-(3.15) of the (generalized) odds process Φ α (we remind that α = 0 for the linear and expected miss models, where ϕ = Φ 0 ). As shown in Sections 3.2-3.3, its dynamic only depends on π from (3.2) (we always fixed π = 0), λ, α, (λ 0 , v 0 (·)) and (λ 1 , v 1 (·)). All the transactions characterized by a value of Φ α greater than the cardholder specific optimal threshold were labeled as fraudulent, according to (3.11)-(3.12). Because of its optimality, the adopted detection strategy minimizes the trade off between early and unjustified credit card blocks and late interventions in disclosing fraudulent transactions and so, under the evaluation measures (3.3)-(3.5), outperforms any other strategy.

Performance measures
By comparing the actual nature of a transaction (variable isTrxFraud in the simulated datasets) and the corresponding model prediction, performance measures commonly used in the literature were computed. As reported in Table 2, transactions identified correctly as fraudulent are said true positives, while those classified correctly as legitimate are the true negatives; we may also have false positives, when legitimate transactions are identified as fraudulent, and false negatives, when fraudulent transactions are predicted as legitimate.
Predicted fraudulent Predicted legitimate Actual fraudulent true positive false negative Actual legitimate false positive true negative Let us denote by TP, FN, TN, FP the number of true positives, false negatives, true negatives and false positives in a dataset. Then, we considered seven standard metrics: the accuracy (Acc), which is the proportion of correct predictions (it could be misleading because, for example, if all the transactions were predicted as legitimate in our datasets, were the percentage of fraud is about 0.1, the accuracy would be around 0.9); the false positive rate (FPR), also known as fallout, which is the proportion of predicted fraudulent transactions among the legitimate ones; the true positive rate (TPR), also called sensitivity or recall, which expresses the proportion of predicted fraudulent transactions among the fraudulent ones; the negative predicted value (NPV), which returns the proportion of actual legitimate transactions among those predicted as such; the precision (Pr), which is the proportion of actual fraudulent transactions among those predicted as such; the Matthews correlation coefficient (MCC), which represents the correlation between the actual and predicted nature of the transactions. Their expressions are reported for completeness in the Supplementary Material. We also derived the area under the ROC curve (AUC), being the ROC (receiver operating characteristic) curve defined as the set of all the pairs of points (FPR, TPR) obtained by letting the cardholders threshold varies.

Results
In the next table the values of the metrics discussed above are shown and are also reported in Figure 4 for a better visualization. They are obtained as the average of the corresponding metrics computed for each of the 20 simulated datasets; in the brackets the standard errors are reported. The first two blocks of Table 3 show the results for the linear model (3.3) when c 1 is 0.1 and 0.2, respectively, across the information schemes of Section 4.2. The abbreviations ED, TrxAm and Coo stand for elapsed days, transactions amounts and geographical coordinates. We see that the results improve as more attributes are considered: the FPR decreases and all the other metrics increase, as expected. Overall the obtained performance measures are very satisfactory both in absolute terms and when compared to the literature: for example, in Bhattacharyya et al. (2011, Tables 6a-6c) the best values of the Acc, FPR, TPR, Pr and AUC are 0.996, 0.001, 0.812, 0.613 and 0.934, respectively; in Carneiro et al. (2017, 4) with c 2 = {10, 50}, exponential model (3.5) with c 3 = {1, 000, 2, 000} and α = 1.3367 × 10 −4 , across different information schemes. For the third information scheme, the bold font is used to highlight the best value of each metric across the different models.
We may notice from Table 3 that an increase of the parameter c 1 implies, for a given information scheme, an increase of the FPR, TPR and NPV and a decrease of the Pr. It is intuitively explained by the fact that when more importance is given to the losses due to a detection delay, each cardholder threshold shifts downwards; then, since the score process is independent of c 1 (as we can see from the AUC which is the same across the two values of c 1 ), more fraudulent transactions are predicted. This leads to an increment of the numerators of the FPR and the TPR, while their denominators, corresponding to  Table 3. For each metric, the corresponding values across the three information schemes are shown: blue bars are for ED, green bars are for ED + TrxAm and yellow bars are for ED + TrxAm + Coo. The numbering 1, . . . , 6 refers to the models considered in Table 3: 1 and 2 are for the linear model when c 1 is 0.1 and 0.2, respectively; 3 and 4 are for the expected miss model when c 2 is 10 and 50, respectively; 5 and 6 are for the exponential model with α = 1.3367 × 10 −4 when c 3 is 1,000 and 2,000, respectively. the actually legitimate and actually fraudulent transactions, remain unchanged. Lower values of the cardholder thresholds also imply that, given that a purchase has been labeled as fraudulent (resp. legitimate), there is a higher chance that it is legitimate, causing a lower Pr (resp. higher NPV).
The third and fourth block of Tables 3 show the metrics for the expected miss model (3.4) with c 2 = 10 and c 2 = 50, respectively. The fifth and sixth block of Table 3 contain the results for the exponential model (3.5) with c 3 = 1,000 and c 3 = 2,000, respectively, when α = 1.3367 × 10 −4 . Considerations analogous to the ones of the linear model apply to these cases as well.

Comparative performance analysis on simulated datasets
Models associated to the data mining techniques discussed in Section 2 can be used as benchmark for the results of Table 3. These models have been trained for each of the three information schemes on the initial dataset of Section 4.1 by using appropriate built-in Matlab functions. For the logistic regression we used the glmfit function by specifying the binomial distribution for the response variable; for the rule-based methods we constructed decision trees based on the CART algorithm (see, e.g., Han et al., 2012) via the fitctree function; for boosting, in order to mitigate the problem of imbalanced data, we applied the RUSBoost (random undersampling boosting) algorithm (Seiffert et al., 2008) by means of the fitcensemble function; for BART we used the pbart function (from the BART R-package), setting the number of posterior draws for each transaction to 500; for random forests we employed the Breiman's algorithm (Breiman, 2001) via the TreeBagger function and we adopted 50 classification trees; for the hidden Markov model we treated the dichotomous variable isTrxFraud as the hidden state and the elapsed times, amounts and geographical coordinates as the observable outcomes and we recovered the maximum likelihood estimates of the transitions and outcomes probabilities through the function hmmestimate; for support vector machines the ISD (iterative single data) algorithm (Kecman et al., 2005) has been used together with a Gaussian kernel for data separation via the fitcsvm function; for neural networks we trained a feedforward network (a special type of neural networks where there are not cycles among neurons, but information moves forward from the input neurons, through the hidden layers, up to the output neurons) with one hidden layer consisting of 10 neurons by means of the patternnet function.
Let us recall that the just cited classification methodologies may underperform when the training data are skewed, like in our case where fraudulent transactions are 2.23%. To overcome this problem and have more meaningful results, firstly data have been balanced by drawing from the initial dataset random sub-samples characterized by a fraud ratio of 10%; apart from boosting (where the RUSBoost algorithm balances data), the algorithms have been subsequently calibrated on these sub-samples.  Table 4. For each metric, the corresponding values across the three information schemes are shown: blue bars are for ED, green bars are for ED + TrxAm and yellow bars are for ED + TrxAm + Coo. LR, DT, Bo, BA, RF, HM, SV and NN stand for logistic regression, decision trees, boosting, BART, random forests, hidden Markov model, support vector machines and neural networks, respectively. Table 4 shows the metrics across the three information schemes of the previous classification models applied to the 20 simulated datasets described at the beginning of Section 4.3. The sign "-" refers to the metrics that, according to their definition in the Supplementary Material, were not computable. The content of Table 4 can be visualized in Figure 5. We may notice that random forests, the hidden Markov model and support vector machines keep the FPR low, while logistic regression, decision trees, boosting,   BART and neural networks have good results when the TPR is considered. Overall, when the NPV, Pr, MCC and the AUC are also taken into account, the best results are given by BART and the neural networks. When Tables 3 and 4 are compared, we observe that the metrics of our proposed models are more satisfactory than those of these classification methods.

Robustness
A calibrated model can also be assessed on its robustness to correctly identify "noisy" transactions. To this aim, we considered three perturbed scenarios where transactions were simulated by increasing the values of the cardholders specific attributes: the intensityλ 0 , the meanμ 0 and standard deviationσ 0 of the logarithmic amounts and the elements of the covariance matrices of the mixture of Gaussian distributions relative to the transactions coordinates. We also considered the case where the underlying distributional assumptions of the observed quantities are modified. Our conclusion is that our model performs sufficiently well also with very noisy transactions; moreover, among the other methodologies of Table 4, BART and neural networks show the best performance also under stressed situations, even though their results are not as good as those of our models. We refer to the Supplementary Material for a thorough analysis.

Factors affecting the metrics on data simulation
At the beginning of Section 4.3 we discussed how transactions have been simulated. Both for the cardholders and the fraudster, these transactions share the following features with the real training dataset: (i) the average number of daily purchases; (ii) the mean and the variance of the logarithmic expenditures; (iii) the mean vector and the covariance matrix of the geographical coordinates of the merchants' stores. However, it is important to underline that in the simulated data: (iv) the fraud ratio is about 10%, while in the training dataset it is about 2.23%; (v) no legitimate transactions occur once a cardholder is hit by fraud, while in the training dataset we observe cases where there are regular purchases between two fraudulent transactions.
In order to understand the possible bias induced by the last two factors and to assess the impact on the final metrics, the simulation has been repeated. 20 new datasets with the same size as the training dataset have been generated under three settings. Setting 1: the fraud ratio is reduced to 2.23%; setting 2: when a cardholder is hit by fraud, legitimate purchases may occur after fraudulent transactions; setting 3: setting 1 and 2 are combined. Our trained models (3.3)-(3.5) have been subsequently used to score and classify the transactions in each of these settings when all the attributes are observed. Then, by means of ANOVA, the results have been compared with those of Table 3, which we refer to as setting 0. The analysis is summarized in Figure 6. 3) with c 1 = 0.1. For a given metric, the corresponding plot reports on the x-axis its range and on the y-axis the four settings. For each setting a circle is placed on the metric average, whose confidence interval is represented by an horizontal line passing from the center of the circle; the averages of two settings are significantly different if their intervals are disjoint. The circle associated with setting 0 (third line of the first section in Table 3) is blue, while the other circles are red. Figure 6 shows the metrics FPR, TPR, NPV, Pr and AUC across the different settings for the trained model (3.3) with c 1 = 0.1. We observe that we always reject the hypothesis that a metric mean remains equal across the settings. We also see that the FPR increases from 8 × 10 −5 to about 0.06 when we move from setting 0 to setting 2 and this can be explained by the fact that the score process Φ α does not immediately fall below the cardholder's optimal threshold when legitimate transactions occur after fraudulent transactions, so that the former are misclassified. The TPR decreases from 0.857 to about 0.64 when we move from setting 0 to setting 1 and 3 and this is due to the fact that a small number of fraudulent transactions keeps Φ α lower, so that their identification is more difficult. The NPV increases from 0.98 to 0.99 when we move from setting 0 to setting 1 and 3, because a lower number of fraudulent transactions makes more likely that a purchase identified as legitimate is actually as such. The Pr decreases from 0.99 to 0.59 and 0.41 when we move from setting 0 to setting 2 and 3, respectively, because when Φ α exceeds the optimal threshold, it may take a while before coming below the threshold in the presence of legitimate purchases occurring after fraudulent transactions. The AUC gets slightly worse when moving away from setting 0, because the correct identification of transactions becomes more difficult as explained for the previous metrics. Analogous results hold true for the other models (3.4)-(3.5).
Generally speaking we can state that the since Φ α is a Markov process and therefore depends on its past values, a different fraud ratio and/or a different mix between fraudulent and legitimate transactions have non negligible impacts on the final metrics. This finding is not true for the other methodologies of Table 4, where transactions are treated as independent, in the sense that their temporal order is irrelevant. Indeed, Figures 7 and 8 show that the FPR, TPR and AUC of boosting and neural networks remain pretty stable across the four different settings, while the NPV and Pr increases and decreases, respectively. Similar considerations hold true for the other analyzed data mining techniques. Figure 7: FPR, TPR, NPV, Pr and AUC across the four different settings for boosting. The circle associated with setting 0 (third line of the "Boosting" section in Table 4) is blue, while the other circles are red.  Table 4) is blue, while the other circles are red.
Additional simulations have been performed by fixing a lower fraud ratio (from 1.5% to 0.05%) in setting 3. The results confirm the previous tendency: in our model, the FPR falls below 1% and the TPR decreases up to 0.54; the NPV increases above 0.99 while the Pr falls to 0.31; the AUC decreases to 0.93. For boosting and neural networks (with similar conclusions for the other techniques), the FPR, TPR and AUC remains almost unchanged as in Figures 7-8; the NPV raises above 0.99, while the Pr decreases to 0.002.

Models testing on real transactions
We tested our calibrated models also on real credit card transactions. We used the transactions occurred between the 1 st and the 7 th of December 2016. Since not all the 4,077 cardholders of the initial dataset (which covers June -November 2016, see Section 4.1) made purchases during this period, our testing dataset refers to 1,441 of them and contains 4,237 transactions, of which 150 are fraudulent. Similarly to the analysis of Table 3, we studied the performance of our models under the three information schemes of Section 4.2; as benchmark we used the classification models of Table 4. Table 5 and Figure 9 report the obtained results. For example, for the logistic regression we see that when only the elapsed time between two consecutive transactions is observed, all the legitimate transactions are labeled correctly, but all the fraudulent transactions are not detected (the TPR is zero); however, when also the amounts and the geographical coordinates are considered, the FPR rises to about 5.7%, but the TPR increases to 32%. Similar considerations hold true for the other classification models, for which the TPR usually increases as more information becomes available. If we concentrate on the most complete information scheme, we see that random forests, support vector machines and BART are the most conservative in terms of the FPR (1.7%, 2.5%  3) with c 1 = 0.1, across different information schemes. For the third information scheme, the bold font is used to highlight the best value of each metric across the different models. and 3%, respectively), while logistic regression, BART and neural networks are more prone to detect fraud as their TPRs (higher than 30%) suggest. Let us specify that the thresholds with which the fraudulent probabilities have been compared were fixed to 0.1 for the logistic regression, 0.07 for BART, 0.08 for the hidden Markov model and 0.2 for the neural network (several threshold values have been tried, but those just reported seem to return the best results). The linear model (3.3) with c 1 = 0.1 shows FPR values which are similar to those of the majority of the benchmark models (about 4% and 6.5% depending on the considered information scheme), but is also characterized by a much higher TPR (from about 32% to 84%), which denotes its good ability to detect fraud. The good performance of our model is also confirmed by the value of the AUC that for each information scheme far exceeds the ones of the benchmarks. Moreover, our model is also fast: it takes 0.0014 seconds on average to score and classify a transaction. Similar conclusions can be drawn when the other models of Table 3 are considered.

Conclusions
In this work we addressed the problem of fraud detection in credit card transactions. Our main contributions are: (i) the application of a new detection methodology based on a Bayesian formulated optimal stopping problem, where the trade-off between an early false detection and a late fraud discovery is taken into account and where the cardholders' expenditures process are assumed to evolve according to a univariate or multivariate compound Poisson process. The Bayesian character of the problem rests on the prior exponential distribution of the fraud time and on the use of posterior probability process Π in (3.6) (or, equivalently, the generalized odds process Φ α ) as sufficient statistics for the optimal detection strategy (3.11)-(3.12); (ii) the computation of cardholders specific optimal thresholds with which posterior probabilities are compared to discriminate between legitimate and fraudulent transactions. This is a direct consequence of the employed optimal stopping approach and allowed us to overcome the hurdle of how to determine a decision threshold. The latter represents one of the main critical issues in fraud detection problems, usually addressed in the available literature by fixing an exogenous, cardholder independent and thus not personalized threshold.
The proposed models have been calibrated on a set of real credit card transactions, under different information schemes involving part or all of the transactions attributes at our disposal: elapsed days, amounts and geographical coordinates. Then, the models have been applied to score simulated and real transactions and the results have been compared with those of other data mining approaches. The following are our findings: (iii) on simulated data with a high fraud ratio of 10% and all the fraudulent transactions occurring after the legitimate ones, our models have superior performance than that of the existing methodologies; (iv) under noisy simulated scenarios of legitimate cardholders' behavior, our method is robust enough to perform better than the existing methodologies; (v) when data are simulated by weakening the conditions of point (iii), the metrics returned by our models suffer a statistically significant worsening. This fact is due to their Markovian nature, for which the temporal order of the transactions has an important impact; instead, this limit does not affect the other classification techniques, which treat data as independent; (vi) when real data are used for testing, the FPR returned by our models is similar to that returned by the other methods, even though the TPR, Pr and AUC metrics beat the benchmarks.
Let us observe that, unlike other methodologies used in fraud detection, our approach is not a black box, since the target functions (3.3)-(3.5) are clearly stated and can be computationally determined. The proposed models are also flexible, in the sense that new attributes of a transaction can be incorporated, and very general, because they could be applied to other frameworks, such as intrusions detection in government or private network systems. Our models must be calibrated for each cardholder and this is an advantage in that decisions are personalized, but, at the same time, also presents three drawbacks: (a) an adequate computational power to speed up the training phase is required; (b) the payment history of a cardholder needs to be sufficiently long for a meaningful estimate of her behavior and, accordingly, (c) transactions of new cardholders not present in the training dataset cannot be scored. Then, we believe that future research in the area of fraud detection could be devoted to the development of hybrid models that mix the existing data mining techniques with our proposal, in order to fully exploit their potential. For example, a "two-steps" procedure could be adopted: in the first step a standard technique is used; in the second step the proposed method could ease the identification of the fraudulent transactions, among those to which a high suspicious score has been previously assigned.