Bayesian Topological Learning for Classifying the Structure of Biological Networks

Actin cytoskeleton networks generate local topological signatures due to the natural variations in the number, size, and shape of holes of the networks. Persistent homology is a method that explores these topological properties of data and summarizes them as persistence diagrams. In this work, we analyze and classify these filament networks by transforming them into persistence diagrams whose variability is quantified via a Bayesian framework on the space of persistence diagrams. The proposed generalized Bayesian framework adopts an independent and identically distributed cluster point process characterization of persistence diagrams and relies on a substitution likelihood argument. This framework provides the flexibility to estimate the posterior cardinality distribution of points in a persistence diagram and the posterior spatial distribution simultaneously. We present a closed form of the posteriors under the assumption of Gaussian mixtures and binomials for prior intensity and cardinality respectively. Using this posterior calculation, we implement a Bayes factor algorithm to classify the actin filament networks and benchmark it against several state-of-the-art classification methods.


Introduction
The actively functioning transportation of various particles through intracellular movements is a vital process for cells of living organisms (Porter and Day (2016)). Such transportation must be intricately organized due to the tightly packed nature of the interior of a cell at the molecular level (Breuer et al. (2017)). The actin cytoskeleton, which consists of actin filaments cross-linking with myosin motor proteins along with other pertinent binding proteins, is an important component in plant cells that determines the structure of the cell and provides transport of cellular components (Freedman et al. (2017); Breuer et al. (2017)). Although researchers have investigated the molecular features of actin cytoskeletons (e.g., Staiger et al. (2000); Shimmen and Yokota (2004); Freedman et al. (2017); Mlynarczyk and Abel (2019)), the underlying process that determines their structures and how these structures are linked to intracellular transport remains undetermined (Thomas et al. (2009) ;Madison and Nebenführ (2013)). A crucial step to understand this transport is to define quantitative measures of the actin cytoskeleton's structure, and understand the different structural networks of filaments on which the organelles are moving. However there is not a method for either fully depicting the characteristics of networks.
Researchers are interested in developing and using quantitative tools to capture actin filament structures. The analysis of skewness in the pixel intensities of microscopic images is performed to quantify the actin cytoskeleton bundles and density in Higaki et al. (2010). Polymer network-based models are studied and classified to identify how actin cytoskeleton structure can be efficiently represented in Banerjee and Park (2015). On the other hand, from a closer look, the inherent variation in size, density, and positioning of actin filaments yields topological signatures in the cytoskeleton's network, (Tang et al. (2014)). In this article, we develop a fully data-driven Bayesian topological learning method, which could aid researchers by providing a pathway to predict cytoskeleton structural properties by classifying simulated actin filament networks to identify the effect of the number of cross-linking proteins on the network. With more cross-linking proteins available, the cell has networks with many binding locations, which create larger loops within the whole structure of the actin cytoskeleton. Topological data analysis (TDA) can be viewed as a dimensionality reduction method that allows us to map data from high dimensional space to a lower dimensional space. When viewed through the lens of topology, these networks show dissimilarity due to the presence and size of loops. Differentiating between the empty space and the connectedness of these networks allows us to create an accurate classification rule using topological methods. Although we focus on our analysis to the classification of actin filament networks, the topological Bayesian framework could be generalized to other data sets.
While there are several methods present in the literature to compute PDs, we choose geometric complexes that are typically used for applications of persistent homology to data analysis; see Edelsbrunner and Harer (2010) and references therein. The homological features in PDs have no intrinsic order, implying that they are sets as opposed to vectors. Due to this, the utilization of PDs in machine learning algorithms is not straightforward. Some researchers map the PDs into Hilbert spaces to adopt traditional machine learning tools (see e.g., Di Fabio and Ferri (2015); Turner et al. (2014); Adams et al. (2017); Bubenik (2015); Reininghaus et al. (2015)). Direct use of PDs for statistical inference and classification has been developed by several authors such as Maroulas et al. (2019Maroulas et al. ( , 2020; Marchese and Maroulas (2018); Bobrowski et al. (2017);Fasy et al. (2014); Mileyko et al. (2011); Robinson and Turner (2017).
In this paper, we quantify the variability of PDs through a novel Bayesian framework by considering PDs as a collection of points distributed on a pertinent domain space, where the distribution of the number of points is also an important feature. This setting leads us to view a PD through the lens of an independent and identically distributed (i.i.d.) cluster point process (PP) (Daley and Vere-Jones (1988)). An i.i.d. cluster PP consists of points that are i.i.d. according to a probability density but have an arbitrary cardinality distribution. For example, an i.i.d. cluster PP is reduced to the classical Poisson PP if the points in a PD are spatially distributed according to a Poisson distribution. The study in Maroulas et al. (2020) implicitly estimates the cardinality of a PD by integrating the intensity of a Poisson PP. The framework of Maroulas et al. (2020) also yields that the variance is equal to the mean and leads to an estimation of cardinality with high variance whenever the number of points in a PD is high. However, modeling PDs as i.i.d. cluster PPs allows us to estimate the intensity and the cardinality component of the distribution simultaneously. This is very critical as the importance of cardinality in PDs has been underlined in problems related to statistics and machine learning Fasy et al. (2014); Kerber et al. (2017).
Our Bayesian framework quantifies prior uncertainty with given intensity and cardinality for an i.i.d. cluster PP. The likelihood in our model represents the level of belief that observed diagrams are representative of the entire population and are defined through marked point processes (MPPs). A central idea of this paper is to develop posterior distributions of the spatial configuration of points on persistence diagrams and their associated number instead of the point clouds in the data generating space. The persistence diagrams summarize their topology which in turn is employed in the classification algorithm. By viewing point clouds through their topological descriptors, the proposed framework can reveal essential shape peculiarities latent in the point clouds. Our Bayesian method adopts a substitution likelihood technique by Jeffreys in Jeffreys (1961) instead of considering the full likelihood for the point cloud. Due to the nature of PDs, an observed PD contains points that correspond to the latent topology in the underlying data as well as points that solely arise due to noise in the data. Our Bayesian model addresses instances of noise by means of an i.i.d. cluster PP. In particular, we are able to quantify the uncertainty with an estimated intensity and cardinality using the i.i.d. cluster PP. This framework estimates the posterior cardinality and intensity simultaneously, which provides a complete knowledge of the posterior distribution.
Another key contribution of this paper is the derivation of a closed form of the posterior intensity, which relies on Gaussian mixture densities for prior intensities and a closed form for the posterior cardinality, which uses binomial priors. The direct benefits of this closed form solution of the posterior distribution are two-fold: (i) it demonstrates the computational tractability of the proposed Bayesian model and (ii) it provides a means to develop a robust classification scheme through Bayes' factors. Another computational benefit of these closed forms is the quantification of the intensity of the unexpected PP by means of an exponential density. The exponential density is an ideal choice because (i) it gives a natural intuition of the unexpected (noise) features, and (ii) it provides a more computationally automatic approach as we only need to modify one parameter. This Bayesian paradigm provides a method for the classification of actin filament networks in plant cells that captures their distinguishing topological features.
Overall, the contributions of this work are: 1. A generalized Bayesian framework that simultaneously estimates the spatial and the cardinality distribution of PDs using i.i.d. cluster PPs. This paper is organized as follows. Section 2 provides a brief overview of PDs and PPs. In Section 3, we establish the Bayesian framework for PDs and provide the update formulas for intensity and cardinality. Then Subsection 3.1 introduces a closed form representation of the posterior intensity and cardinality utilizing Gaussian mixture models and binomial distributions respectively. Detailed demonstrations of this closed form estimation are presented in Subsection 3.2. To assess the capability of our Bayesian method, we investigate a problem of classifying filament networks of plant cells in Section 4. Finally, we end with a discussion in Section 5. We delegate all of the proofs, as well as some definitions, lemmas, and notations required for the proofs to the supplementary materials (Maroulas et al., 2021).

Preliminaries
We begin by discussing the necessary background for generating Bayesian models for PDs. In Subsection 2.1, we briefly review simplicial complexes, the building blocks for constructing PDs. Pertinent definitions, theorems, and some basic facts about i.i.d. cluster point processes (PPs) are discussed in Subsection 2.2.

Persistence Diagrams
Definition 2.1. The convex hull of a finite set of points {x i } n i=1 is given by x k ] is a collection of k + 1 affinely independent elements along with their convex hull. The faces of a k-simplex are the (k − 1)-simplices spanned by subsets of {x 0 , . . . , x k }.
Definition 2.4. A simplicial complex σ is a collection of simplices such that for every set A in σ and every nonempty set B ⊂ A, we have that B is in σ.
Definition 2.5. The Vietoris-Rips complex for threshold > 0, denoted VR( ), is the abstract simplicial complex determined in the following way: a k-simplex with vertices given by k + 1 points in X is included in V R( ) whenever /2 balls placed at the points all have pairwise intersections. Formally, for each homological dimension, a PD is a multiset of points (b, d), where b is the radius in the Vietoris-Rips complex at which a homological feature is born and d is the radius at which it dies. Intuitively, the homological features represented in a PD are connected components or holes of different dimensions. To illustrate the Vietoris-Rips complexes we present a toy example in Figure 1 by considering an "eight" shape in (a). The algorithm starts by taking into account circles with increasing radii (at each algorithmic step) centered at each data point. As the "resolution" of the data changes by increasing the radii, homological features emerge or disappear by examining if two or more circles intersect. For example, Figure 1(b) presents a scenario where only two circles intersect. While the circles centered at each data point grow and more connected components are generated, holes and voids may be also created (see Figure 1 (c)). Eventually, they get filled due to increasing the radii, and the process ends when all circles intersect. The algorithmic results are summarized in a persistence diagram. Each point in a persistence diagram (shown as red triangles in Figure 1

I.I.D. Cluster Point Processes
This section contains basic definitions and fundamental theorems related to i.i.d. cluster PPs. Detailed treatments of i.i.d. cluster PPs can be found in Daley and Vere-Jones (1988) and references therein. Throughout this section, we let X be a Polish space and X be its Borel σ-algebra.
Definition 2.6. A finite point process ({ρ n }, {P n (•)}) consists of a cardinality distribution ρ n with ∞ n=0 ρ n = 1 and a symmetric probability measure P n on X n , where X 0 is the trivial σ-algebra.
To sample from a PP, first one draws an integer n from the cardinality distribution ρ n . Then the n points (x 1 , . . . , x n ) are spatially distributed according to a draw from P n . Since PPs model unordered collections of points, we need to ensure that P n assigns equal weights to all n ! permutations of (x 1 , . . . , x n ). The requirement in Definition 2.6 that P n is symmetric guarantees this. A natural way to work with random collections of points is the Janossy measure, which combines the cardinality and spatial distributions, while disregarding the order of the points.
Definition 2.7. For disjoint rectangles A 1 , . . . , A n , the Janossy measure for a finite point process is given by Definition 2.8. An i.i.d. cluster PP Ψ is a finite PP on the space (X, X ) which has points that: (i) are located in X = R d , (ii) have a cardinality distribution ρ n with ∞ n=0 ρ n = 1, and (iii) are distributed according to some common probability measure F (·) on the Borel set X .
We consider Janossy measures for the point process Ψ, J Ψ n , that admit densities j n with respect to a reference measure on X due to their intuitive interpretation. In particular, for an i determines the probability density of finding the n points at their respective locations according to F . The n! term gives the number of ways the points could be at these positions. For a finite intensity measure Λ on X that admits the density λ, we also have f (x) = λ(x) Λ(X) . The intensity is the point process analog of the first order moment of a random variable. Precisely, the intensity density λ(x) is the density of the expected number of points per unit volume at x. Hereafter, we sufficiently characterize our i.i.d. cluster PPs with intensity and cardinality measures. Next, we define the marked PP, which provides a formulation for the likelihood model used in our Bayesian setting. Let M be a Polish space that represents the mark space, and let its Borel σ-algebra be M. Definition 2.9. Suppose : X × M → R + ∪{0} is a function satisfying: 1) for all x ∈ X, (x, •) is a probability measure on M, and 2) for all B ∈ M, (•, B) is a measurable function on X. Then, is a stochastic kernel from X to M.
Remark 1. A marked point process (Ψ, Ψ M ) is a bivariate PP where one point process is parameterized by the other. Therefore, if the cardinalities of x and m are equal, then the conditional density for m is (m|x) = 1 n! π∈Sn n i=1 (m i |x π(i) ), where S n is the set of all permutations of (1, . . . , n). Otherwise, the density can be taken as 0.
The final two definitions we need to construct the Bayesian theorem are the probability generating functional (PGFL) and elementary symmetric function. The PGFL is a point process analogue of the probability generating function (PGF) of random variables. Intuitively, the point process can be characterized by the functional derivatives of the PGFL (Moyal (1962)). The other necessary definitions and theorems related to the PGFL, which will be heavily used in the proof of Theorem 3.1 are provided in Section 1 of the supplementary materials.
Definition 2.11. Let Ψ be a finite PP on X and H be the Banach space of all bounded measurable complex valued functions ζ on X. For a symmetric function, ( 2.1) The first expression shows the analogy of the PGFL with the PGF, as it is the expectation of the product Remark 2. For an i.i.d. cluster process Ψ the PGFL has the form (Daley and Vere-Jones (1988)): where g N is the PGF of the cardinality N , ζ has the same form as in Definition 2.11, and f is the probability density discussed after Definition 2.8.

Bayesian Inference
For developing the framework for Bayesian inference, we consider the underlying prior uncertainty of a PD, D X , generated by an i.i.d. cluster PP, D X , with intensity λ D X and cardinality distribution ρ D X = P (|D X | = n), the probability of the number of elements in the PP D X to be equal to n, where | · | denotes the cardinality. To motivate the discussion and develop the intuition about our method, let's consider the point clouds (blue dots in Figure 2 (a) and (c)) are generated from an 'eight' shape (solid line Figure 2 (a) and (c)). A perfect noiseless point cloud yields a 1-dimensional PD with two points whose y-coordinates (persistence values) are much greater than 0. However, noisy point clouds would lead to PDs which may or may not clearly expose the topological 'fingerprint' of prominent points depending on the level of noise. For example, observe in Figure 2 (b) that the two higher persistence points are well separated from the noise points close to the birth axis. But, that's not the case in Figure 2 (d) where only one prominent point is detected To that end, sample persistence diagram from a prior which carries the knowledge about topological signatures of the underlying truth would be partially observed. Hence a point x in D X would be observed with probability α(x) or vanish with probability (1 − α(x)) as a result of noise in the data. Consequently, the prior D X is decomposed as Employing the theory of marked PPs, we establish the likelihood model by considering the association with the observed PDs, D Y , samples of an i consists of the points in the data (persistence diagram) that are linked to D X O via a pertinent marked PP with stochastic kernel (y|x) (as in Definition 2.9). Consequently, for the marked PP (D X O , D Y O ), the intensity (spatial) likelihood is computed using the stochastic kernel (y|x). The cardinality likelihood is obtained by the conditional distribution of the observed PD D Y given that there are n points in D X . Typically, PDs D Y consist of points that correspond to the latent topology of the point cloud and that generate solely from noise. Consequently, the points that arise from noise fail to associate with the prior and we call them unexpected features. We model such points as generated by an i.i.d. cluster PP D Y U with intensity density λ D Y U and cardinality probability distribution ρ D Y U . Figure 3 gives a visual representation to illustrate the contribution of the prior and the observations to the spatial and the cardinality distributions of PDs of the Bayesian framework. For this, we superimpose two PDs: one is a sample from the prior (D X and shown as triangles) and the other is the observed PD (D Y and shown as dots) (see Figure 3 (a)). More precisely, the PD of a synthetic point cloud is considered as D X and the PD of a perturbed version of the point cloud is considered as D Y . Any arbitrary point x in D X is equipped with a probability of being observed, which we present using blue (D X O ) and brown (D X V ) otherwise in Figure 3. Presumably, any point x ∈ D X O is marked with an observed point in D Y O via the marked PPs. This implies that for any possible configuration, the number of points in D X O will be the same as D Y O as shown in the blue box in Figure 3 (c). Also, we present the unexpected features D Y U (associated to noise) as red in Figure 3. We show different possible scenarios for the relationship of the prior to the data likelihood for the cardinality distribution of PDs in Figure 3 (b). As any point in D X V has no association with points in the observed PD D Y , if all of the points in D X belong to D X V , in the observed PD we encounter only the unexpected D Y U (the first bar in Figure 3 (b)). This scenario is possible in the presence of very high noise in data. As some points of D Y are more likely to be marks than others, we illustrate these instances with different levels for the blue parts of the cardinality bars. The last two bars demonstrate cases where all of the points in D Y are expected to be marks of the prior features; this is encountered in the presence of very low noise in data. The stochastic matching is used as likelihood basically to understand the nature of points on the observed PDs. The posterior intensity and cardinality are given in the theorem below, whose proof is delegated to Section 1.1 in the supplementary materials.
Theorem 3.1. For a random PD, denote the prior intensity and cardinality by λ D X and ρ D X , respectively. Suppose α(x) is the probability of observing a prior feature, and D X O and D X V are two instances of observed and vanished features in the prior respectively. If (y|x) is the stochastic kernel that links D Y O with D X O , and λ D Y U and ρ D Y U are the intensity and cardinality of D Y U respectively, then for a set of independent samples of PDs D Y1:m = {D Y1 , · · · , D Ym } from D Y with cardinalities K 1 , · · · , K m , we have the following posterior intensity and cardinality: dx is a linear functional, P n i is the permutation coefficient, and the sum in Γ 0,0 In the posterior intensity expression given in (3.1), the two terms reflect the decomposition of the prior intensity. Due to the arbitrary cardinality distribution assumption for i.i.d. cluster point processes, the two terms are also weighted by two factors B(∅) and B(y) respectively. The first term is for the vanished features D X V , where the intensity is weighted by 1 − α(x) and B(∅). The factor B(∅) is encountered since there is no y ∈ D Yi to represent the vanished features D X V . The second term in (3.1) corresponds to the observed part D X O and is weighted by α(x) and B(y). The factor B(y) depends on specific y ∈ D Yi to account for the associations between the features in D X O and those in D Yi . To be more precise, if x ∈ D X is observed, it can be associated with any of the y ∈ D Yi and the remaining points of D Yi , defined as D Yi \ y, are considered to either be observed from the rest of the features in D X or originated as unexpected features D Y U .
The posterior cardinality is given in (3.2). The associated likelihood is given as the sum from k = 0 to K i , where K i is the number of features in D Yi . This provides the likelihood of each observed PD D Yi given that there are n points in D X . In particular, for k = 0, the cardinality term for the unexpected feature reduces to ρ D Y U (K i ) and the intensity term for the vanished feature reduces to (λ D X [1 − α]) n . This implies that if the observed PD consists only of unexpected features then all of the points in the prior are most likely to have vanished. As the value of k increases, contributions from the unexpected features and vanished features decrease, indicating the presence of more associations between prior and observed features through the marked point process

Closed Form of Posterior Estimation
Next, we present a closed form solution to the posterior intensity and cardinality equation of Theorem 3.1 by considering a Gaussian mixture density for the prior intensity and a binomial distribution for the prior cardinality. Below we specify the necessary components of Theorem 3.1 to derive these closed forms.

(M1)
The expressions for the prior intensity λ D X and cardinality ρ D X are: where N is the number of components of the Gaussian mixture, μ D X i is the mean which is a 2 × 1 vector of birth and persistence coordinates, and σ D X i I is the 2 × 2 covariance matrix of i-th component. Since PDs are modeled as point processes on the space W not on R 2 , the Gaussian densities are restricted to W as N * (z; υ, σI) := N (z; υ, σI)1 W (z), with mean v and covariance matrix σI, and 1 W is the indicator function of W. N 0 ∈ N is the maximum number of points in the prior PP and ρ x ∈ [0, 1] is the probability of one point to fall in the space W.
μ D Y U controls the rate of decay away from the origin. This distribution for λ D Y U considers points closer to the origin more likely to be unexpected features. Points close to the origin in PDs are often created either from the spacing between the point clouds due to sampling or from the presence of noise in the data. Typically points with higher persistence or higher birth represent significant topological signatures, so for our analysis we count them as less likely to be unexpected. The cardinality distribution is where M 0 ∈ N is the maximum number of points in the PP D Y U and ρ y ∈ [0, 1] is the probability of one point to fall in the space W.
Proposition 3.1. Suppose that λ D X , ρ D X , (y|x), λ D Y U , and ρ D Y U satisfy the assumptions (M1)-(M3), and α is fixed. Then the posterior intensity and cardinality of Theorem 3.1 are given by: We present the proof in Section 2 of the supplementary materials. One can see that the intensity estimation in (3.8) is in the form of a Gaussian mixture, and hence it is obtained from a conjugate family of priors. However, we do not observe a similar property for the cardinality estimation. A detailed example of these estimations is provided in Section 3.2. The cardinality distribution in (3.9) is computed for infinitely many values of n, which is unattainable. Hence, for the practical application, we must truncate n at some N max such that N max is sufficiently larger than the number of points in the prior PP. Without loss of generality, we can choose N max = N 0 .

Sensitivity Analysis
We present the following example to (i) illustrate the estimation of the posterior using (3.8) and (3.9), (ii) examine the effects of the choice of prior intensity and cardinality on the posterior distributions, and (iii) examine the effects of likelihood function and unexpected features parameters on the posterior distributions. To reproduce these results, the interested reader may download our R-package BayesTDA. We consider point clouds generated from a polar curve that contains two inner loops (see Figure 4 (a)) and focus on 1-dimensional features in their corresponding PDs as they are the important homological features of this shape.

Choice of Priors
We commence by defining an i.i.d cluster PP with three types of prior intensities: (i) informative, (ii) weakly informative, and (iii) uninformative, and two

Parameters for (M1)
Prior  (M1). We present all intensity maps on a scale from 0 to 1 throughout this example to ensure uniformity. Due to the symmetric nature of the polar curve, in a noiseless scenario, the corresponding PD consists of two points each with multiplicity two. Hence we use two Gaussian components centered at the two points with a very small variance and weights c D X i = 2 for the informative intensity (II) (see Figures 4, 6-7 (b)). The weakly informative intensity (WII) also has two Gaussian components centered at the same points as II with a slightly higher variance ((see Figures 4, 6-7 (c))). The informative cardinality (IC) is determined by using a discrete distribution with the highest probability at cardinality 4 (see Figures 4, 6-7 (e)). On the other hand for the uninformative intensity (UI), we use one Gaussian component centered at an arbitrary point with higher variance than the informative cases as shown in Figures 4, 6-7 (d). Similarly, the uninformative cardinality (UC) follows a discrete uniform distribution (see Figures 4, 6-7 (i)). We present the list of parameters used to define the prior PP in Table 1. We examine the cases below.
The observed PDs are generated from point clouds sampled uniformly from the polar curve and perturbed by varying levels of Gaussian noise with variances 0.001I 2 (Figure 4 (a)), 0.005I 2 (Figure 6 (a)), and 0.01I 2 (Figure 7 (a)) which are considered in Case-1, Case-2, and Case-3 respectively. Consequently, their PDs exhibit distinctive characteristics such as four prominent features with high persistence and very few spurious features, four prominent features with medium persistence and several spurious features, and three prominent features with medium persistence and many spurious features. Figure 4    Furthermore, for this case we present a comparison between the cardinality statistics given by using i.i.d. cluster point process characterization of the PD presented herein and a Poisson point process framework presented in Maroulas et al. (2020) that estimates the number of homological features by integrating the estimated posterior intensity. As discussed earlier, the Poisson PP framework approximates the cardinality as a Poisson distribution, and consequently this estimation produces higher variability as the number of points increases. However, the i.i.d. cluster PP characterization leads to accurate estimation of the cardinality with tighter variance (see Figure 5).  Table 2: List of parameters for (M2) and (M3). For Cases-1, 2, and 3, we consider point clouds sampled from the polar curve and perturbed by Gaussian noise having variances 0.001I 2 , 0.005I 2 , and 0.01I 2 respectively.

Case-2
We consider all of the priors as in Case-1. The point cloud used for this case (Figure 6 (a)) is more perturbed around the polar curve than Case-1 (Gaussian noise with variance 0.005I 2 ). The associated PD, presented as black triangles overlaid on the posterior intensity plots, exhibits more spurious features. The parameters used for this case are listed in Table 2. First, we estimate the posterior intensity and cardinality for all six combinations using the same parameters as in Case-1, and the results are presented in Figure 6 (f)-(h) and (j)-(l). For the combinations (II, IC), (WII, IC), and (UI, IC) of priors, the posterior intensity and cardinality can accurately estimate the holes with different variance levels. As the WII equipped with variance higher than II, the variance of posterior cardinality obtained from (WII, IC) is slightly higher than that of (II, IC). However, due to the presence of several spurious features the three combinations, (II, UC), (WII, UC), and (UI, UC), slightly overestimate the cardinality. Next, to illustrate the effect of observed data on the posterior, we adjust two parameters, the variance of the likelihood σ D Y O and the decay parameter of the unexpected features, μ D Y U . Recall that the intensity density of the PP D Y U , consisting of the unexpected features in the observation, is exponential (Equation (3.6)), where μ D Y U controls the rate of decay away from the origin. We present the updated posteriors from the three combinations of priors (II, UC), (WII, UC), and (UI, UC). By decreasing the variance of the likelihood σ D Y O , the posterior intensities rely more on the observed features in the PD (see Figure 6 (n)-(p)). On the other hand, by adjusting the decay parameter, we enable our model to recognize the presence of several spurious features in PD. This improves the estimation of posterior cardinality, which is evident in Figure 6 (n)-(p).

Case-3
In this case we consider the point cloud (Figure 7 (a)), which is very noisy (Gaussian noise with variance 0.01I 2 ). Due to the noise level, we encounter only three points with medium prominent persistence, and there are many spurious features. All the priors are the same as in Case-1 and Case-2. The associated PD is presented as black triangles overlaid on the posterior intensity plots. The parameters used for this case are listed in Table 2. First, we estimate the posterior intensity and cardinality for all six combinations using the same parameters as in Case-1, and the results are presented in Figure 7

Classification of Actin Filament Networks
In this section, we classify 150 simulated actin filament networks in plant cells. Such filaments are key in the study of intracellular transportation in plant cells, as these bundles and networks make up the actin cytoskeleton, which determines the structure of the cell and enables cellular motion. In particular, three different classes of networks with different numbers of cross linking proteins were considered. The number of cross linking proteins in the network affects the shape it. The networks were created using the AFINES (Active Filament Network Simulation) stochastic simulation framework introduced in Freedman et al. (2018Freedman et al. ( , 2017, which models the assembly of the actin cytoskeleton. In the data set the location of beads that represents segments of individual filaments are known. The value of each parameter in the simulation process is chosen to mimic real actin filaments. Hence our Bayesian topological learning method herein can be implemented directly to real actin filament network data sets. Higher numbers of cross-linking proteins produce local geometric signatures (Tang et al. (2014)). However, the differences are not always notable due to the presence of noise in the data, which is a routine scenario for real experiments. To that end, we adopt a data-driven scheme for classification using an uninformative flat prior. We learn the networks in training sets by means of their respective PDs as they distill salient information about the network patterns with respect to connectedness and empty space (holes), i.e. we can differentiate between filament networks by examining their homological features.
The three classes generated with the cross-linking proteins numbers of N = 825, 1650, and 3300 are denoted as C 1 , C 2 , and C 3 , respectively (see Figure 8 (a)-(c) for examples). From the viewpoint of topology, class C 2 and class C 3 contain more prominent holes than class C 1 . Also, their respective PDs have different cardinalities. Hence, this topological aspect yields an important contrast between these three classes. To capture these differences we employ the following Bayes factor classification approach by relying on the closed form estimation of posterior distributions discussed in Section 3.1. A PD D that needs to be classified is a sample from an i.i.d. cluster point process D with intensity λ D and cardinality ρ D and its probability density has the form p D (D) = ρ D (|D|) d∈D λ D (d). For a training set Q Y k := D Y k 1:n for k = 1, · · · , K from K classes of random diagrams D Y k , we obtain the posterior intensities from the Bayesian framework using Proposition 3.1. The posterior probability density of D given the training set Q Y k is given by (4.1) and consequently, the Bayes factor is obtained by the ratio The final assignment of the class of D is obtained by a majority voting scheme. Figure 9: The intensity density for the unexpected feature PP used in classifying the filament networks.
PDs with 1-dimensional features (see Figure 8 (d)-(f) for an example of each class) were created for each actin network through Rips filtration as discussed in Section 2.1, which were then used as input for the Bayes factor classification scheme of (4.1). The number of 1-dimensional features in the dataset is large and the posterior estimation for this dataset is not computationally attainable. To mitigate this issue, we subsample the dataset to reduce the size of it. Precisely, our subsampled dataset consists of 25 points from each of the PDs obtained from the 150 simulated filament networks. We found that taking more than 25 points from each of the PDs did not improve the classification, and typically led to a very expensive computational scheme. Table 3 summarizes the choices of parameters for the model. The intensity density of D Y U used for the classification is presented in Figure 9.
One intuitive interpretation of the unexpected features is that they represent the presence of noise in the dataset, consequently they often have very short persistence. On the other hand, the dataset of filament networks routinely consists of several incomplete loops, which imply that points with late birth and short persistence are expected from the underlying topology. Since we use 10-fold cross validation to estimate the model's accuracy, the posterior is calculated using the training set for each fold and each class. Then for each instance, we assign the class by using the majority voting scheme. We compute the resulting area under the receiver operating characteristic (ROC) curves (AUCs) and the results are listed in Table 4. The AUC across 10-folds was 0.925. Further details on the classification problem are given in the supplementary materials.

Comparison with Other Methods
We compared our method with several other machine learning algorithms to benchmark against them. We mainly pursued two avenues -(i) features selected using TDA methodology, and (ii) features selected using non-TDA methodology. The TDA methodologies are persistence images (PIs) (Adams et al., 2017), persistence landscapes (PLs) (Bubenik, 2015), and Euler characteristic curves (ECCs) (Richardson and Werman, 2014). These summaries have been widely implemented as they are amenable to the existing machine learning methodologies. The main theme of these summaries is the extraction of a pertinent feature vector and implement a classifier trained using machine learning algorithms. Here we input these topological summaries as features for three different optimized classification algorithms: random forest (RF), support vector machine (SVM), and neural network (NN).
We considered a vector of 2500 values at which the PLs of order 1, 2, and 3 are evaluated, and found that the third order PL to be the most efficient summary for this classification task. In order to compute the PIs, we discretize the domain space into a 50 × 50 grid with a spread of 0.1. The linear ramp function is used to produce weights for computing PIs. We explore the classification problem using PIs with and without incorporating the linear weights and found that the PIs without any weights provide better accuracy than those with weights. This is justified as the linear ramp function assigns more weights to the higher persistence points leaving the local features to be insignificant. Another topological summaries we implemented are ECCs from 0 and 1-dimensional persistence diagrams where the filtration were constructed from the range of birth and persistence values. We optimally tune the parameters of SVM using a grid search. Precisely, the parameter γ of the radial basis kernel, that is the inverse of the standard deviation of the kernel, was optimally selected from a range of 0.1 to 1 with a spread of 0.1. In order to choose the optimal parameters for NN, we performed an extensive grid search for all parameters. However, we found that out of all the parameters, the only two that can potentially improve the classification accuracy are the number of hidden layers and the maximum number of iterations. The optimal performance was achieved for PLs with 20 layers and maximum iterations of 10, for PIs with 3 layers and maximum iterations of 200, and for ECCs with 20 layers and maximum iterations of 10. For the RF algorithm we employ 500 trees.
Additionally, we compare our method with machine learning algorithms where the features are selected using a non-TDA method. As the filament networks pose a very definite spatial structure, we found the most useful method to extract the feature is the Raster images Hijmans (2019). In particular, the raster image represents data by using a grid with a value assigned for each pixel. The assigned value can reflect a wide variety of information. In our analysis, we discretize the domain of a filament network into 2500 grid cells identified by 50 rows and 50 columns, and then count the number the points of each grid cell. This approach not only converts each filament network into a raster image which in turn is used as input to machine learning algorithms but also captures the definite spatial structures such as the presence of empty space and connectedness in a very efficient manner. We present an example in Figure 10. The parameters for the machine learning algorithms are tuned in a similar fashion, i.e., the parameters are optimally tuned using a grid search. The optimal performance for NN was achieved with 5 layers and maximum iterations of 200. The results of this comparison are in Table 4, which showcases that our method outperforms the other methods.

Discussion
This paper has proposed a generalized Bayesian framework for PDs by modeling them as i.i.d. cluster point processes. Our framework provides a probabilistic descriptor of the diagrams by simultaneously estimating the cardinality and spatial distributions of points on a PD. In this work we focus on developing Bayesian model for PDs only.
A fusion of different topological summaries engaged within a Bayesian perspective is a worthwhile future direction.
It is noteworthy that our Bayesian model directly employs PDs, which are topological summaries of data, for defining a substitution likelihood rather than using the entire point cloud. This deviates from a strict Bayesian model, as we consider the statistics of PDs rather than the underlying datasets used to create them; however, our paradigm incorporates prior knowledge and observed data summaries to create posterior distributions, analogous to the notion of substitution likelihood in Jeffreys (1961). The general relationship between the likelihood models related to point cloud data and those of their corresponding persistence diagrams remains an important open problem. We demonstrate that a valid update of the prior distribution on persistence diagrams to the posterior can be made by substitution of the likelihood through a topological summary of the data rather than a traditional likelihood function. Indeed, the idea of utilizing topological summaries of point clouds in place of the actual point clouds proves to be a powerful tool with applications in wide-ranging fields. This process incorporates topological descriptors of point clouds, which simultaneously decipher essential shape peculiarities and avoid unnecessarily complex geometric features.
We derive closed forms of the posterior for realistic implementation, using Gaussian mixtures for the prior intensity and binomials for the prior cardinality. A detailed example showcases the posterior intensities and cardinalities for various interesting instances created by varying parameters within the model. This example exhibits our method's ability to recover the underlying PD. Thus, the Bayesian inference developed here opens up new avenues for machine learning algorithms and data analysis techniques to be applied directly to the space of PDs. Indeed, we derive a classification algorithm and successfully apply it to simulated filament networks data, while we compare our method with other TDA and machine learning approaches successfully.