## Abstract

Matching problems are discussed in many elementary probability books, such as Feller (1968). In one version of the problem, as described by Hodges and Lehmann (1964), the photographs of $n$ film stars are paired randomly with $n$ photographs of the same stars taken when they were babies, and the distribution of the number of correct matches is derived. In this paper we shall study the same problem when the photographs are paired on the basis of various measurements that are made on them, rather than randomly. For example, suppose that $r$ different facial measurements are made on the photograph of each star and that $s$ facial measurements are made on each baby photograph. By comparing these measurements, it will typically be possible to devise a method for pairing the photographs that will yield a larger number of correct matches than would be obtained from random pairing. In fact, the procedures that will be developed in this paper can be regarded as formalizations of the heuristic procedures that a person follows when he pairs the photographs on the basis of perceived resemblances. In other versions of the same problem, dental records of parents are to be matched with dental records of their children, or measurements made on the chest X-rays of $n$ individuals are to be matched with other medical records of these same individuals. The problems described here are related in principle to problems of document linkage that have been treated in the statistical literature [see, e.g., DuBois (1969) and the references given there] but the models and methods that are used here seem to be new and unrelated to the models and methods that have previously been used in such problems. For any positive integer $k$, we shall let $R^k$ denote the space of all $k$-dimensional vectors $z = (z_1, \cdots, z_k)$, where $- \infty < z_i < \infty$ for $i = 1, \cdots, k$. Now let $T$ denote an $r$-dimensional random vector $(r \geqq 1)$, let $U$ denote an $s$-dimensional random vector $(s \geqq 1)$, and suppose that $T$ and $U$ have some specified joint distribution over the space $R^{r + s}$. We shall assume that a random sample of $n$ vectors $(t_1, u_1), \cdots, (t_n, u_n)$ has been drawn from this joint distribution. It is assumed, however, that before the values in this sample can be observed, each vector $(t_i, u_i)$ in the sample is broken into two separate vectors, namely the vector $t_i$ with $r$ components and the vector $u_i$ with $s$ components. The vectors $t_1, \cdots, t_n$ are then observed in some random order, say $\nu_1, \cdots, \nu_n$ and the vectors $u_1, \cdots, u_n$ are observed in some independent random order, say $w_1, \cdots, w_n$. As a result of this randomization, it is not known how the vectors $\nu_1, \cdots, \nu_n$ and the vectors $w_1, \cdots, w_n$ were paired in the original sample. It is assumed that a priori (i.e., before the specific values of $\nu_1, \cdots, \nu_n$ and $w_1, \cdots, w_n$ are observed) all $n!$ ways of pairing $\nu_1, \cdots, \nu_n$ with $w_1, \cdots, w_n$ are equally likely to reproduce the original sample. The observed vectors $\nu_1, \cdots, \nu_n$ and $w_1, \cdots, w_n$ will be called the values of a broken random sample from the specified joint distribution of $T$ and $U$. The general problem to be considered here is that of pairing the observed vectors $\nu_1, \cdots, \nu_n$ with the observed vectors $w_1, \cdots, w_n$ in order to reproduce as many of the vectors $(t_i, u_i)$ from the original sample as possible. The application of this model to the problem of matching the photographs of $n$ individuals with their baby photographs or matching two sets of medical records of $n$ individuals should be clear. The important assumption that we have made here is that the observations for the $n$ given individuals can be regarded as the values in a random sample of size $n$ from some larger population of individuals for which the probability distribution is known. In this paper we shall assume that the joint distribution of $T$ and $U$ can be represented by a joint pdf $f$ of the following form: \begin{equation*}\tag{1.1} f(t, u) = \alpha(t)\beta(u) e^{\gamma(t)\delta(u)} \quad t \in R^t, u \in R^s,\end{equation*} where $\alpha, \beta, \gamma, \text{and} \delta$ are arbitrary real-valued functions of the indicated vectors. If either $r = 1$ or $s = 1$, and if the joint distribution of $T$ and $U$ is a multivariate normal distribution, then their joint pdf will be of the form (1.1). A multivariate normal distribution of this type is undoubtedly the most important special case of (1.1). In particular, if both $t$ and $u$ are one-dimensional, the pdf of every bivariate normal distriubtions is of the form (1.1). Another example of a vivariate pdf of the form (1.1) is \begin{align*} f(t, u) &= t e^{-t(1+u)} \\ t > 0, u > 0, \\ &= 0 \\ \text{otherwise}\end{align*} We shall now present a summary of the specific problems that will be considered in this paper and some of the results that will be obtained. In Section 2, the problem of pairing the vectors $\nu_1, \cdots, \nu_n$ with the vectors $w_1, \cdots, w_n$ in order to maximize the probability of a completely correct set of $n$ matches is considered. It is shown that the probability is maximized if the values of $\gamma(\nu_1), \cdots, \gamma(\nu_n)$ are ordered from smallest to largest, the values of $\delta(w_1), \cdots \delta(w_n)$ are similarly ordered, and corresponding terms in these two orderings are paried with each other. This solution is also the maximum likelihood solution for the problem of pairing $\nu_1, \cdots, \nu_n$ with $w_1, \cdots, w_n$. In Section 3 this maximum likelihood solution is applied to the multivariate normal distribution and is shown to have a natrual and intuitive interpretation in terms of regression. In Section 4, we consider the problem of choosing a vector $w_j$ from the set $w_1, \cdots, w_n$ in order to maximize the probability of correctly matching one specified vector $\nu_1$ from the set $\nu_1, \cdots, \nu_n$. It is shown that if $\gamma(\nu_i)$ is the minimum or the maximum of the $n$ values $\gamma(\nu_1), \cdots, \gamma(\nu_n)$, then $\nu_i$ should be paired with a vector $w_j$ for which $\delta(w_j)$ is a minimum or a maximum, respectively. For intermediate values of $\gamma(\nu_i)$, the solution is shown to be more complicated. In Section 5, the problem of pairing $\nu_1, \cdots, \nu_n$ with the vectors $w_1, \cdots, w_n$ in order to maximize the expected number of correct matches is considered. Although the general solution of this problem is complicated, it is shown here again that the vector $\nu_i$ for which $\gamma(\nu_i)$ is a minimum should always be paired with the vector $w_j$ for which $\delta(w_j)$ is a minimum and the vector $\nu_i$ for which $\gamma(\nu_i)$ is a maximum should always be paired with the vector $w_j$ for which $\delta(w_j)$ is a maximum. In particular, it follows that when $n = 3$, the solution to this problem and the maximum likelihood solution are always identical. In Section 6, sufficient conditions are given under which, for an arbitrary value of $n$, the maximum likelihood solution will also maximize the expected number of correct matches. The simplest and most striking sufficient condition given there, but also the most severe condition, is that $\lbrack \max_i \gamma(\nu_i) - \min_i\gamma(\nu_i) \rbrack \lbrack \max_j \delta(w_j) - \min_j \delta(w_j) \rbrack \leqq 1.$ Finally, in Section 7, some examples are given in which these sufficient conditions are not satisfied and the maximum likelihood solution does not maximize the expected number of correct matches.

## Citation

Morris H. DeGroot. Paul I. Feder. Prem K. Goel. "Matchmaking." Ann. Math. Statist. 42 (2) 578 - 593, April, 1971. https://doi.org/10.1214/aoms/1177693408

## Information