Finite space Kantorovich problem with an MCMC of table moves

In Optimal Transport (OT) on a finite metric space, one defines a distance on the probability simplex that extends the distance on the ground space. The distance is the value of a Linear Programming (LP) problem on the set of non-negative-valued 2-way tables with assigned probability functions as margins. We apply to this case the methodology of moves from Algebraic Statistics (AS) and use it to derive a Monte Carlo Markov Chain (MCMC) solution algorithm. MSC2020 subject classifications: Primary 62R01 65C05 60K35; secondary 62H17 62H05.


Introduction
In the present paper, we aim to show a connection between Optimal Transport (OT) and Algebraic Statistics (AS).
Modern OT was started by Kantorovich in 1939 and a new wave of development was initiated by Villani [22]. In the present paper we use also an earlier result obtained by Gini [10]. A (finite) sample space X and a cost function c : X × X → R are given. The set of joint probability functions γ on X × X with given margins μ and ν is called the set of couplings, γ ∈ P (μ, ν). In OT, one looks for an element that minimizes the expected value c(γ) = x,y∈X c(x, y)γ(x, y). There is a rich general theory, see, for example, the textbook by Santambrogio [18], but here we restrict our attention to the finite state space case.
AS was started by the paper Diaconis and Sturmfels [8] and by the book Pistone, Riccomagno, and Wynn [14]. In particular, the first paper deals with an algebraic method for constructing an irreducible random walk on the space of multi-way contingency tables with given margins. Each step of the random walk is associated with a move, that is, a table with zero margins, that subtracted to an initial table, produces a new table with the same margins. Basic results on contingency tables are to be found in Fienberg [9].
We extend this idea to general tables, that is, tables not restricted to be integer-valued, and apply it to OT on a finite state space. To this aim, we provide a detailed study of the geometry of moves with continuous values. This paper considers both topics in computational algebra and in computational statistics. As an application, we define an MCMC algorithm for the computation of the optimal value and the optimal coupling in the case of a discrete sample space. Many special algorithms have been developed, see a general overview in Peyré and Cuturi [13]. Our algorithm is intended to be an alternative proposal.
The paper is organised as follows. In Section 2 we review the generalities and discuss the algebra of moves, considering both the linear algebra and the group algebra of moves. The Kantorovich problem is a special Linear Programming (LP) problem that we outline both as a primal and as a dual problem. In Section 3 we prove that a class of basic moves connects all couplings. The results are generalized to the tri-variate case in Section 4. Based on that theory, in Section 5 we provide a MCMC algorithm to compute solutions of the minimal cost problem. γ 1 = 1/6 1/3 1/2 0 , γ 2 = 1/2 0 1/6 1/3 .
The notion of couplings has a related setup in the context of the study of integer-valued tables with given margins. Given a table T = [n(i, j)] n i,j=1 ∈ Z n×n + , the grand total is n(+, +) = n i,j=1 n(i, j) and the margins are n(·, +), n(+, ·). The corresponding probability function is defined by γ(i, j) = n(i, j)/N , . Each vertex of the left simplex is mapped to a vertex of the right polytope, δ ij → δ i ⊗ δ j . The dashed segment from γ 1 to γ 2 represents the coupling polytope of the margins represented by the circle in the right polytope. Notice that γ 1 belongs to the facet opposite to δ 22 , while γ 2 belongs to the facet opposite to δ 12 . with i, j ∈ {1, . . . , n}. Conversely, if γ ∈ Δ(X × X) has rational values, it comes from a table. See the extensive treatments in [9] and [20].
Let c : X × X → R + be a non-negative valued function to be interpreted as the cost. The cost of a coupling γ (c-cost) is c(x, y)γ(x, y) . (2.2) We are interested in minimizing the expected cost over the polytope of couplings. The Kantorovich cost (K-cost) is K c (μ, ν) = inf {c(γ) | γ ∈ P (μ, ν)} . (2.3) Especially, when the cost is a distance d, the minimum cost defines a distance on the simplex Δ(X), the Kantorovich distance (K-distance), namely, The distance case is considered in detail in [12].
As the simplex is a compact set, the optimal value is always obtained at some optimal coupling.
In the case of equality of the two margins μ = ν, the distance is zero because there is a coupling whose support consists of loops only, where d(x, x) = 0.
When the coupling is defined by the independence, γ = μ ⊗ μ, the Kantorovich value is a Gini index of dispersion of μ, see the monograph by Yitzhaki and Schechtman [23].
The Kantorovich problem defined above is a special LP problem, in that we want to find the minimum of a linear function subject to equality and inequality constraints. It follows immediately from the definition that there exists a face of P(μ, ν) whose elements γ are optimal, that is, c( γ) = K c (μ, ν) or, in the distance case, d(μ, ν) = x,y d(x, y) γ(x, y). Generically, the set of solutions will be a vertex of the coupling polytope, hence subject to the support constraints of Proposition 2.1.
Let us discuss an equivalent form of the Kantorovich problem. The marginalization operator is and ker Π is the set of all functions f : X × X → R whose margins are zero. It follows that Let us show that the convex set is, in fact, a compact convex set. In fact, for each f ∈ A and all (x, y), it holds The same argument applies to the other variable, so that f (x, y) ≥ −(μ(x) ∧ ν(y)). In conclusion, In turn, this allows to give a proof of the following continuity result.
Proof. This is an application of Berge's Maximum Theorem, see, for example, [1, § 17.5]. Here is a sketch of a proof. As the function to optimize is continuous, one has to show that the mapping (μ, ν) → A(μ, ν) is both upper and lower hemicontinous, see the definitions in [1, § 17.2]. In our case, upper hemicontinuity follows from the compactness. Lower hemicontinuity is proved by considering a sequence (μ n , ν n ) converging to (μ, ν) and noting that the elements of the sequence A(μ n , ν n ) are convex and contained in an -neighborhood of A(μ, ν).
As the Kantorovich problem is an LP problem, the duality theory applies, see, for example, [3,§ IV.8]. Equations (2.2) and (2.3) can be written in primal standard form as The equivalent dual standard form is that is, in the matrix representation.
In this paper, we restrict our attention to the primal problem. However, the dual problem is interesting in that the domain does not depend on μ, ν, but it depends on the cost c only.
Let us observe that the feasibility domain {φ ⊕ ψ} in the dual problem can be further restricted. For a full presentation of the following argument, see [18, § 1.6]. If φ(x) + ψ(y) ≤ c(x, y), then φ 1 (x) = inf y c(x, y) − ψ(y) has the following properties: The same argument applies to ψ. In conclusion, the feasible domain can be restricted, without changing the maximum, to all pairs (φ, ψ) such that (2.6) In particular, the optimal pair satisfies all the conditions above.
When the cost c is a distance (denoted, if any confusion could arise, by d), then the Kantorovich construction induces a distance on probability functions. Moreover, it is possible to define metric geodesics and hence, a proper geometry associated to the given distance. The following proposition provides the details. The extension property is a key characteristic of the K-distance which is not shared by other statistical measures of divergence.

Proposition 2.3. Assume that the cost function in Equation
1. The K d value is a distance that extends the ground distance, that is, the K-distance between two Dirac probability functions equals the distance between the respective supports.
2. Given μ, ν ∈ Δ(X), the mixture curve μ(t) = (1 − t)μ + tν, 0 ≤ t ≤ 1, is a metric geodesic for the K-distance, that is, 3. If γ is optimal for d(μ, ν), then the coupling defined by Proof. This proof is known from the quoted literature. We repeat it here for sake of completeness. Given the existence of optimal couplings, we can write Moreover, defines a coupling γ of μ and ν whose value is less than or equal to the sum of the two values. Notice that d must be a distance because we want to use the triangle inequality to check the last statement. The other two statements are proved together. First, one checks that γ(s, t) is indeed a coupling of μ(s) = (1−s)μ+sν and μ(t) = (1−t)μ+tν, and its value . But none of the inequalities can be strict, because otherwise, This concludes the proof.
The previous proposition does not rule out the existence of multiple geodesics between two points.
We will take also advantage of the following definition from the algebraic theory of two-way contingency tables, see, for example, [15] and [2]. Remember that the affine space of the convex polytope P(μ, ν) is the vector space generated by the differences γ 1 − γ 2 , γ 1 , γ 2 ∈ P(μ, ν). Clearly, the margins of the elements of the affine space are null.
Notice that there are n 2 2 different basic moves up to the sign. They are not linearly independent. We prove below that, given a pivot point (u, v), the (n − 1) 2 basic moves of the type (δ The dimension of ker Π is (n − 1) 2 . For each u, v ∈ X, the set of basic moves Proof. Note first that the image of the marginalization mapping is a space of di- . In fact 1 t A1 = 1 t A t 1, and, given any pair of margins f and g such that x f (x) = y g(y), the outer product f ⊗ g is a counter-image. It follows that the dimension of the kernel is n 2 − (2n − 1) = (n − 1) 2 .
Every basic move (δ u − δ x ) ⊗ (δ v − δ y ) is clearly an element of the kernel. Let us find a basis of M. Let M ∈ M and fix u, v ∈ X.
Equation (2.7) now follows immediately adding over all u, v such that M (u, v) > 0.
We have shown that every move M is a linear combination of the (n − 1) 2 basic moves (δ In particular, all other basic moves are combination of these special moves. More generally, if M is a simple move, In spite of the (n − 1) 2 pivotal moves around (u, v) form a linear basis of the vector space of moves, we will need to use all basic moves in order to perform a connected random walk that stays in the polytope P (μ, ν), see [20].

Proposition 2.5. The move M is the difference of two couplings, γ, γ ∈ P (μ, ν) if, and only if, both hold
Conversely, assume M is a move, decomposed in its positive and negative part, Notice that x a(x) = y b(y) = h, so that there exist a non-negative M * : X × X → R whose margins are (μ − a) and (ν − b), respectively, and whose grand total is 1 − h.
The equations provide the required coupling.

Proposition 2.6. Every move M is of the form
where α 1 , . . . , α k > 0 and F 1 , . . . , F k are simple moves. Moreover, it is possible to choose the basic moves in such a way that, for the sequence of remainders Proof. Let M be a move and define the two sets of indices Without restriction of generality, assume that the first projection of M + has n points. Let us define a directed bipartite graph with vertices Edges of the first type are horizontal in the table, while edges of the second type are vertical. At least one edge of the first type always exists for each x because the sum over that row is null. The same holds for each column y. By construction, there are at least 2n edges in the graph and at most 2n vertices. Hence, there is at least one irreducible cycle with even length, say 2m. Fix a starting point in M + and enumerate the vertices as Let us construct a simple move from the cycle above. Observe that where the indices in the second expression are computed mod m. The first expression shows that the first margin is zero, while the second expression shows that the second margin is zero. We are interested in the characterisation of moves which are the difference of two coupling, where the first one is fixed.
The couplings γ and γ are related to each other through M and α. In particular, the cost of γ depends on α, on the cost of γ, and on the cost of M . We are especially interested in M being a simple move. In such a case, Now, this property can be restated in a more specific form.

Proposition 2.7. Let M be a simple move and let
Proof. Clearly, the two sets {M = +1} and {M = −1} have the same number of points. Let (x j , y j ), j = 1, . . . , k, be an arbitrary sequencing of the second one. The move is The first margin is It follows that x j = x σ (i) for some permutation σ ∈ S k . Considering the second margin, we find y j = y σ (j) for some permutation σ ∈ S k . Now the required identity follows by taking σ = σ σ −1 .
From Equation (2.8), it follows that the c-cost of a simple move M can be written as (2.9) The condition in Equation (2.9) appears in the literature under the name given in the following definition. This name is due to Rockafellar [16, §24], who considered a similar property as a condition for a multi-mapping to be the subdifferential of a convex function. (2.10) The cyclical monotonicity for the cost c of Supp (γ) is a known sufficient and necessary condition for the optimality of γ in the corresponding Kantorovich problem. It is the so-called Fundamental Theorem of Optimal Transport, see, for example, [18, § 1.6]. Here, we want to discuss the same topic in the algebraic language of moves by using the following simple equivalence.
Proof. Assume there exists a sequence in G such that (2.10) does not hold. This is equivalent to saying the corresponding move has a positive value and support contained in G.
We restate the Fundamental Theorem as follows. The proof is to be found, for example, in [18, § 1.6]. We will provide a different proof in the next section. Now we briefly discuss the algebraic properties of simple moves, see [19]. Proposition 2.7 shows that, given a set , and, conversely, every simple move is of this type. Notice that the representation is not unique, because if σ(i) = i, then the two corresponding terms cancel.
Let us consider first the effect of the composition of two permutations. If Now, every permutation is a product of circular permutations. Consider for example, the case σ = π 1 π 2 , where π 1 , π 2 are circular permutations with support I 1 and I 2 , respectively. Choose a coding such that I 1 = {1, . . . , h}, That is, every simple move is the sum of simple moves associated to a circular permutation on disjoint supports. In turn, this shows that the support of a simple move is a union of cycles.
Last case to consider is the case of a permutation given as a product of exchanges. If π = (i ↔ j), and G = {(x 1 , y 1 ), (x 2 , y 2 )}, then the simple move is , which is, in fact, a basic move. Indeed, every simple move is the sum of basic moves. This is a representation different from that obtained by considering a linear basis because the representing basic moves depend on the original simple move. They are not restricted to be elements of a basis.
We conclude this section highlighting that the optimality is related with the existence of cycles in the support of the coupling, as the following proposition suggests.
Proposition 2.10. Let γ ∈ P(μ, ν) be a coupling such that Supp (γ) contains a cycle and assume that the cost is a distance, denoted by d. Then there exists a coupling γ * ∈ P (μ, ν) such that d(γ * ) ≤ d(γ) and γ * − γ is proportional to a simple move.
Assume now that Supp (γ) contains a cycle of length greater than 2. Two cases arise.
If there are two concordant consecutive arrows of the form , is admissible and reduces the cost by virtue of the triangular inequality, Moreover, applying this move, the original cycle is replaced by a cycle with one edge less.
Finally, if all consecutive edges of Supp (γ) are discordant, such as in then an integer move (not necessarily basic) can be applied both with positive and negative sign. For the example above, the relevant move is Choosing a sign such that the cost does not increase, and α = min{γ(1, 2), γ (3,4), γ(5, 6)} or α = {γ(1, 6), γ (3,2), γ(5, 4)} depending on the sign, one edge of the circuit is deleted. Notice that all the moves used to reduce a cycle do not produce new cycles because their supports are contained in the relevant cycle.

Couplings, homophily, and moves
Early in the 20 th century, Gini [10] defined the notion of index of homophily for a sample (x i , y i ) N i=1 of a bi-variate real random variable (X, Y ). His aim was to discuss a general notion of statistical dependence by comparing the value of E (|X − Y |) with its minimum and maximum value in the class of joint probability functions with the same margins. Based on that, Gini introduced an associated statistical index that was extensively studied in the following years by himself and by others, especially by Salvemini [17] and Dall'Aglio [6]. A modern account of the Gini methods is to be found in the monograph by Yitzhaki and Schechtman [23]. Below we describe his work in the context of the subsequent developments by Kantorovich, who was inspired more by early work by Monge on OT than by Gini's methodological ideas. Here we use Gini's method as an intermediate tool to solve more general Kantorovich problems.
Given a bi-variate real sample (x i , y i ) N i=1 , let us sort in ascending order both the first and the second variables, respectively, This operation produces a new bi-variate sample (x (i) , y (i) ), i = 1, . . . , N, with the same marginal sample distributions as the original one. Gini calls it the co-graduation of the original sample.
Clearly, this is a special case of the general theory of coupling, because the original discrete sample distribution and its co-graduation have the same margins.
The difference between the original sample distribution and the co-graduation is the simple move where σ and σ are permutations of S N that provide the sorting of each of the two sequences.
Clearly, two finite real sequences are co-monotone if they are co-graduated, and two co-monotone sequences are turned into two co-graduated sequences by a suitable common permutation.
We observe that, if a joint probability function has rational probabilities, then it can be simulated by a finite sequence of couplings. The following proposition is the original Gini's theorem. Notice that the theorem provides a special case of cyclical monotonicity for the distance d(x, y) = |x − y|. a coupling of (μ, ν). The index

with joint sample distribution γ and marginal distributions μ and ν, the joint distribution of each bi-variate sequence
is minimum when the two sequences are co-monotone and is maximum when they are counter-monotone.
Proof. It is enough to consider (as Gini himself does) the co-graduated (respectively counter-graduated) case. Consider each pair of successive indices i and i + 1. Note first that both Enumeration of all possible cases of signs of the differences shows that the minimum is actually the lower bound above and it occurs when the two sequences are co-monotone.
Remark 3.1. From the point of view of transport theory, we have found that the coupling of maximal index is obtained through the cross-tabulation of the two co-graduated marginal distributions. In modern terms, we can say that Gini has found the L 1 -optimal coupling of the two marginal distributions when the frequencies are rational. and M G = 8.
where a i and b j are the values of the two margins, respectively, and n(i, j) and n co (i, j) are the counts in the original table and in H, respectively. The previous argument applies to tables of counts, that is, when the frequencies are rational numbers. More generally, the table H of the example above could be derived from the margins by the so called North-West rule, that is, moving left to right and top to bottom each cell gets the maximum value compatible with the marginal constraints. See the history of the earlier results in [7]. We are going to see that the North-West rule does produce the maximal homophily coupling in the general discrete case.
In the following, without restriction of generality, consider the case where both the values of x and y are {1, . . . , n}. In this way we have a natural total order on the sample space.

Proof. For each pair of indices (i, j), consider (h, j), h > i, and (i, k), k > j.
Let us show that n(h, j) and n(i, k) cannot be both positive. In fact, assume there exists t 1 and t 2 such that x t1 = h, y t1 = j, x t2 = i, y t2 = k. Necessarily, t 1 = t 2 . As x is non-decreasing and x t1 > x t2 , it holds t 1 > t 2 . As y is non-decreasing and y t1 < y t2 , it holds t 1 < t 2 . We have obtained a contradiction and we have shown that only one of the two counts left and down can be positive. More precisely, if n(i, k) > 0 for some k > j then n(h, j) = 0 for all h > i, that is, if the rest of the row is not all zero, then the rest of the column is. The same holds exchanging rows and columns.
To conclude, write Equation (3.1) as and observe that at least one among k>j n(i, k) and h>i n(h, j) is zero.

Proof. First note that Equation (3.2) is well defined because the right hand side of the equation involves pairs of indices which precede the current one (i, j).
We want γ H to be non-negative with margins μ and ν, and i,j γ H (i, j) = 1. To prove the proposition, we proceed by recursion on the lines. Consider the first element γ H (1, 1) = min{μ(1), ν(1)} . If μ(1) = ν(1), then γ H (1, 1)  As the above procedure does not depend on the normalization of the margins, we can apply the procedure iteratively.
Noticing that the H-coupling is unique in P(μ, ν), the proof of the theorem rests on the following proposition. Given a coupling γ ∈ P(μ, ν), there exist a sequence of basic  moves M 1 , . . . , M k and a sequence of real positive numbers α 1 , . . . , α k such that

Proposition 3.4.
Proof. We scan the table γ from (1, 1) to (1, n) in the first row, then from (2, 1) to (2, n) in the second row and so on.
Let us consider the probability γ(i, j). If then there exist indices i 1 > i and j 1 > j such that Thus we can apply the basic move M i,i1,j,j1 with +1 in (i, j 1 ) and (i 1 , j), and −1 in (i, j) and (i 1 , j 1 ). Let α i,i1,j,j1 = min{γ i,j1 , γ i1,j } and we move from γ to γ − α i,i1,j,j1 M i,i1,j,j1 . Notice that for a given (i, j) only a finite number of moves can be applied since at each step one probability in the i-th row or in the j-th column goes to zero, and therefore the procedure ends in a finite number of steps.
In the following remark we show that the Euclidean distance in R is a typical case where the optimal coupling is not unique. We observe that if the ground set is X = {1, 2, 3, 4} with the Euclidean distance d(i, j) = |i − j|, then all the three couplings have the same c-cost, namely c(γ) = 1.5, which is also equal to the Kantorovich distance. Although this example is rather special, because it has one row and one column with zero probability, nevertheless it allows us to show an example with several couplings sharing the same c-cost by means of small tables. Notice that the coupling γ H is the coupling of maximum homophily, while the coupling γ D has the highest possible concentration on the main diagonal.
Moreover, all the mixtures of the three previous couplings have again c(γ) = 1.5, showing that the set of the optimal couplings is a face of the polytope. This derives from the fact that with d(i, j) = |i − j| the basic moves involving one diagonal cell, namely of the form M i1,i2,i2,j2 , with i 1 < i 2 < j 2 , have a null Kantorovich value.
The following proposition highlights an interesting connection between the discrete and the continuous frameworks for the case of the Euclidean distance. In the discrete case the optimality of the H-table follows from previous results, and the optimality in the continuous case is derived.

Proposition 3.5. Given any pair of non-decreasing real sequences
, with sample marginal distributions μ N and ν N , respectively, the homophily coupling γ H coincides with the distribution of (x i , y i ) N i=1 and hence it minimizes among all couplings in P(μ N , ν N ). In general, given any pair of discrete probability functions μ and ν, γ H (μ, ν) is optimal for the Euclidean distance in R.
Proof. The first part follows directly from Proposition 3.2. The second part follows from the continuity of (μ, ν) → K c (μ, ν), see Proposition 2.2.
It is easily checked that the function γ α = γ − αM ∈ P(μ, ν) whose value is (1, 2) + d(3, 1) − d(3, 2)) , and where d (1, 2) + d(3, 1) − d(3, 2) ≥ 0 is true by the triangle inequality. The equality must hold, otherwise the value would be strictly smaller than the Kdistance. In conclusion, γ α is an optimal coupling with γ α (1, 1) > 0 and with all the other diagonal elements equal to those of the original γ.  Next proposition asserts that the support of an optimal coupling is generically a connected graph. A detailed study how the support of an optimal coupling depends on the given distance has been made in [12].

Proposition 3.7.
If the support of the optimal coupling γ is a disconnected graph, with connected components (X i , S i ), i = 1, . . . , k, then μ(X i ) = ν(X i ) for all i = 1, . . . , k and γ = k i=1 γ i , where each γ i is supported by X i × X i and is proportional to an optimal coupling for the conditional margins, μ| Xi and ν| Xi .
Proof. Without restriction of generality, we consider the case k = 2. Assume the supporting graph of γ has components (X 1 , S 1 ) and (X 2 , S 2 ). This means that γ(x, y) = 0 unless x and y belong both to X 1 or both to X 2 . It follows that and, for the same reason, μ(X 2 ) = x2,y2∈X2 γ(x 2 , y 2 ) = ν(X 2 ). Now, the Kdistance takes the conditional form Each of the conditioned couplings γ| Xi×Xi , i = 1, 2 is a coupling of the conditioned margins μ| Xi and ν| Xi , and such couplings are necessarily optimal.

Multidimensional extension
In this section we extend the results in Proposition 3.3 to the case of joint probability functions with three given margins.
is well defined, and it is unique. We name this joint probability function as the joint probability function of maximal homophily. In Equation (4.1) the sign ≺ is to be read in lexicographic order, e.g., (j, k) ≺ (j, k) if and only if either j < j or j = j and k < k.
Proof. We prove that the definition in Equation (  We now introduce the basic moves in the tri-variate case and we prove that they are enough to connect all joint probability functions, using the same arguments as in the bi-variate case. To ease the notation, we write only the indices and we omit the symbol δ when considering the moves.
There are two types of basic moves: in the first type the +1 have a common index, while in the second type the +1 have all different indices.  Two examples of basic moves are pictured in Figure 2.
We are now ready to extend Theorem 3.1 to the tri-variate case.
Proof. We prove that from each joint probability function we can reach the maximal homophily by using basic moves, following the same strategy as in the proof of Theorem 3.4.
If the condition in Equation (4.1) is not satisfied, then there is an entry (i, j, k) such that Then, define the integer move M with • −1 in (i, j, k), (i , j , k ) and (i , j , k ); • +1 in (i, j , k ), (i , j, k ) and (i , j , k).
Such a move, applied with the coefficient α above, satisfies the condition in Equation (4.1) in the point (i, j, k). The new points in (i , j , k ) and (i , j , k ) are lexicographically greater than (i, j, k), so that scanning the joint probability function from (1, 1, 1) lexicographically the procedure ends in a finite number of steps.
Finally, note that if the move M lies in a slice (i.e., i = i = i or j = j = j or k = k = k ) the move M is a basic move since one +1 and one −1 coincide. In the other cases, the move M can be decomposed into two basic moves: • M 1 with −1 in (i, j, k) and (i , j , k ), +1 in (i, j , k ) and (i , j, k); • M 2 with −1 in (i , j, k) and (i , j , k ), +1 in (i , j, k ) and (i , j , k).

Algorithm
The Simulated Annealing for continuous variables has been introduced in [21], then optimized in several ways for special applications. In its basics, a Simulated Annealing algorithm seeks to find the minimum of a real function through a Markov chain whose stationary distribution is uniform on the set of the global minima. At each step, the Markov chain moves in a suitable set of neighbours and the transition probability is selected in order to have the desired stationary distribution. For further details, see [11]. The basic moves introduced in the previous sections allow us to define the neighbours and to obtain a connected chain. Moreover, we exploit the special properties of the Kantorovich function, and through Proposition 3.6 we perform one further optimization step.
The pseudo-code of the algorithm is given in Figure 3. To simplify the presentation, we write the algorithm in the case of two-dimensional joint probability functions, but it can be easily adapted to the three-dimensional case.
To choose the simulation parameters (i.e., the initial temperature τ 0 and the length of the Markov chain B), we have performed a preliminary simulation In the first part of the simulation study, we have computed the acceptance probability of the first move of the MCMC as a function of the initial temperature τ 0 . The results are displayed in Table 1. Each value is based on a sample of 10, 000 pairs of marginal probability functions μ, ν. Each entry of μ, ν is chosen under the uniform distribution U[0, 1], and the two vectors are then normalized.
Remark 5.1. Our Simulated Annealing implementation has the independence coupling as its starting point. This is because it is a joint probability distribution far from the vertices of the polytope.
The initial temperature τ 0 can be chosen reasonably small. For instance, if we fix 0.95 as the acceptance probability of the first move, τ 0 decreases with n and ranges from 10 −0.6 for n = 4 to 10 −2.0 for n = 20.
In the second part of the simulation study, we have inspected when the Markov chain does not produce new moves to evaluate the convergence of the algorithm. For values of the number B of the MCMC steps ranging from 10 to 10 5 , we have computed how many moves would be accepted in a window of 100 further steps. The simulation is based on 1, 000 pairs of marginal probability functions μ, ν in each case, randomly chosen as in the previous part of the study. The initial temperature for each n has been chosen from the first part of the study, as outlined above. The temperature decrease function used here is τ = τ 0 (0.95) b , b = 1, . . . , B, but similar results are obtained for other choices, namely τ = τ 0 (0.99) b , τ = τ 0 /b, τ = τ 0 / log (1 + b).
The proportions of accepted moves are displayed in Table 2. We observe that for values of the number B of the MCMC steps ranging from 10 3 and 10 5 the acceptance probability of a new move is less than 0.001. Table 3 Optimal coupling of Example 5.1 found by the Simulated Annealing.