Multiclass classification for multidimensional functional data through deep neural networks

The intrinsically infinite-dimensional features of the functional observations over multidimensional domains render the standard classification methods effectively inapplicable. To address this problem, we introduce a novel multiclass functional deep neural network (mfDNN) classifier as an innovative data mining and classification tool. Specifically, we consider sparse deep neural network architecture with rectifier linear unit (ReLU) activation function and minimize the cross-entropy loss in the multiclass classification setup. This neural network architecture allows us to employ modern computational tools in the implementation. The convergence rates of the misclassification risk functions are also derived for both fully observed and discretely observed multidimensional functional data. We demonstrate the performance of mfDNN on simulated data and several benchmark datasets from different application domains.


Introduction
Functional data classification has wide applications in many areas such as machine learning and artificial intelligence [19,10,17,5].While the majority of the work on functional data classification focuses on one-dimensional functional data cases, for example, temperature data over a certain period [16], and speech recognition data (logperiodogram data) [11,23], and there are few results obtained for multidimensional functional data classification, for example, 2D or 3D images classification.To the best of our knowledge, there is only one recent work [22] where binary classification for general non-Gaussian multidimensional functional data was considered.However, their proposed classifier is designed for fully observed functional data and only able to conduct binary classification.There is also a lack of literature working on multiclass classification for functional data.[11] investigated a Bayesian approach to estimate parameters in multiclass functional models.However, their method is still only suitable for fully observed one-dimensional functional data classification.
The availability of massive databases resulted in the development of scalable machine learning methods to process these data.In this paper, we are interested in multiclass classification for multidimensional functional data, including but not limited to 2D and 3D imaging data classification, in the framework of DNN.The existing statistical and machine learning methods belong to the general framework of multivariate analysis.That is, data are treated as vectors of discrete samples and permutation of components will not affect the analysis results, hence the ordering of the pixels or voxels is irrelevant in these analyses and the imaging data were treated as high-dimensional data.In recent years, many sparse discriminant analysis methods have been proposed for i.i.d.high-dimensional data classification and variable selection.Most of these proposals focus on binary classification and they are not directly applicable to multiclass classification problems.A popular multiclass sparse discriminant analysis proposal is the ℓ 1 penalized Fisher's discriminant [24].However, [24] does not have theoretical justifications.It is generally unknown how close their estimated discriminant directions are to the true directions, and whether the final classifier will work similarly as the Bayes rule.More recently, [14] proposed a multiclass sparse discriminant analysis method that estimates all discriminant directions simultaneously with theoretical justification for high-dimensional data setting.We compare our method with both [24] and [14] carefully on numerical performance in Sections 5 and 6.
An efficient way to handle multi-dimensional functional data, for example, imaging data, is to incorporate the information that is inherent in order and smoothness of processes over pixels or voxels.In this paper, we propose a novel approach, called as multi-class functional deep neural network (mfDNN, multiclass-functional+DNN), for multidimensional functional data multiclass classification.We extract the functional principal components scores (FPCs) of the data functions, and then train a DNN based classifier on these FPCs as well as their corresponding class memberships.Given these FPCs as inputs and their corresponding class labels as outputs, we consider sparse deep ReLU network architecture and minimize the cross-entropy (CE) loss in the multiclass classification setup.CE loss is a standard tool in machine learning classification, but the technique has not been utilized to its fullest potential in the context of functional data classification, especially regarding the study of theoretical properties of DNN classifiers.A recent work [2], has an example of multiclassification by minimizing CE loss in sparse DNN for cross-sectional data setting.While the powerful capabilities of DNN and CE loss have been clearly demonstrated for i.i.d.data, how to best adapt these models to functional data remains under-explored.We apply the same loss function in [2] but develop it in the context of multidimensional functional multiclassification.Furthermore, convergence rates and the conditional class probabilities are developed under both fully observed and discretely observed multi-dimensional functional data setting, respectively.
The rest of this article is organized as follows.In Section 2, we introduce the Kclass functional data classification models and propose the multiclass functional deep neural network classifiers.In Section 3, we establish theoretical properties of mfDNN under suitable technical assumptions.Section 4 provides two progressive examples to demonstrate the validity of these technical assumptions.In Section 5, performances of mfDNN and its competitors are demonstrated through simulation studies.The corresponding R programs for implementing our method are provided on GitHub.In Section 6, we apply mfDNN to the zip code dataset and Alzeheimer's Disease data.Section 7 summarizes the conclusions.Technical proofs are provided in Appendix.

Suppose we observe
d }, which are independent of X(s) to be classified, and class label ⊺ , such that Y ik = 1 if the sample is in the k-th group, k = 1, 2, . . ., K, and 0 otherwise.Throughout the entire paper, we consider the number of class K is a universal constant not depending on sample size n.In multiclass classification with K ≥ 2 classes, we are interested in grouping a newly observed continuous curves X(s) to one of the K classes given the training data.
If Y ik = 1, X i (s) presumably has some unknown mean function EX i (s) = µ k (s) and unknown covariance function the Karhunen-Loève decomposition: where inner product, and , where ξ j 's are pairwise uncorrelated random coefficients.By projecting the function space to vector space, we define the conditional probabilities where e k = (0, . . ., 0, 1, 0, . . ., 0) ⊺ is a K-dimensional standard basis vector denoting the label of the k-th class is observed and ξ = (ξ 1 , ξ 2 , . ..) ⊺ is an infinite dimension random vector.Notably,
Let ξ . When X i 's are some random processes with complex structures, such as non-Gaussian processes, one major challenge is the underlying complicated form of {h k } K k=1 so that estimation of {π k } K k=1 is typically difficult.In this section, inspired by the rich approximation power of DNN, we propose a new classifier so-called mfDNN, which can still approach Bayes classifiers closely even when h k 's are non-Gaussian and complicated.
Due to the large capacity of the fully connected DNN class and to avoid overparameterization, we consider the following sparse DNN class: where ∥ • ∥ ∞ denotes the maximum-entry norm of a matrix/vector, which is controlled by 1, ∥ • ∥ 0 denotes the number of non-zero elements of a matrix/vector, and s controls the total number of active neurons.Given the training data (ξ J , Y 1 ), . . ., (ξ where ϕ(x 1 , x 2 ) = x ⊺ 1 log (x 2 ) denotes the CE loss function.We then propose the following mfDNN classifier: for The detailed implementation for the proposed mfDNN classifier is explained in Algorithm 1.The tuning parameters include J, L, p, and s, which theoretically contribute to the performance of classification.We discuss the related theories in Section 3. In practice, we suggest the following data-splitting approach to select (J, L, p, s).Note that the sparsity is not directly applicable by assigning the number of active neurons.Instead, we use dropout technique to achieve sparse network.

Assumptions for functional multi-classification setup
To formulate our classification task, we first introduce two mild assumptions for π k (•).
Without further notice, c, C, C 1 , C 2 , . . ., represent some positive constants and can vary from line to line.
Assumption 2 provides a uniform lower bound of π k , indicating that π k is bounded away from ϵ in probability.
Remark 1. Assumptions 1 and 2 both depict the characteristic of π k around zero in different aspects.Specifically, Assumption 1 controls the decay rate of the probability measure of {π k (ξ) : 0 ≤ π k (ξ) ≤ x}.Assumption 2 truncates the probability measure below ϵ to be zero.It can be seen that Assumptions 1 and 2 are closely related in certain circumstances.For example, given some ϵ > 0, Assumption 2 implies Assumption 1 for arbitrary α and C = max k ϵ −α k , and Assumption 1 implies Assumption 2 when C = 0 in a trivial case.However, they are not equivalent in most scenarios when C,{α k } K k=1 , and ϵ are all universally positive constants.
Essentially, functional data have intrinsically infinite dimension.Owing to this unique phenomenon, we introduce the following finite approximation condition for π .For any positive integer J, define ξ Assumption 3 provides a uniform upper bound on the probability of π k differing from π (J) k by at least x, which approaches zero if either J or x tends to infinity, implying that when dimension is reduced, there exists some large enough J, such that π (J) k is an accurate approximation of π k .All three assumptions can be verified in concrete examples included in Section 4.

Conditional probability π k with complex structure
In the following, we impose composition structures on π k (•).For t ≥ 1, a measurable subset D ⊂ R t , and constants β, R > 0, define + , let G(q, J, d, t, β) be the class of functions g satisfying a modular expression where g u = (g u1 , . . ., g udu+1 ) : R du → R du+1 and g uv : R tu → R are locally β u -Hölder smooth.The d u arguments of g u are locally connected in the sense that each component g uv only relies on t u (≤ d u ) arguments.Similar structures have been considered by [18,1,13,21,2,9,8] in multivariate regression or classification to overcome high-dimensionality.Generalized additive model [7] and tensor product space ANOVA model [12] are special cases; see [13].
We define the class of h = {h 1 , . . ., h K } as . Note that this density class H includes many popular models studied in literature, both Gaussian and non-Gaussian; see Section 4.
Throughout the paper, we explore π k in some complicated G with group-specific parameters q (k) , d (k) , t (k) and β (k) .The selection range of the truncation parameter J is based on the asymptotic order provided in Assumptions 4 and 5 in the next section.Although π effective arguments, implying that the two population densities differ by a small number of variables.Relevant conditions are necessary for high-dimensional classification.For instance, in high-dimensional Gaussian data classification, [4,3] show that, to consistently estimate Bayes classifier, it is necessary that the mean vectors differ at a small number of components.The modular structure holds for arbitrary J, which may be viewed as a extension of [18] in the functional data analysis setting.

Kullback-Leibler divergence
For the true probability distribution π(x) = (π 1 (x), . . ., π K (x)) ⊺ and any generic estimation π(x) = ( π 1 (x), . . ., π K (x)) ⊺ , define the corresponding discrete version Kullback-Leibler divergence For any π trained with {(X i (s), Y i )} n i=1 , we evaluate its performance by the loglikelihood ratio Note that the estimation risk is associate with the well-known CE loss in Section 2.2.Different from the popular least square loss, the CE loss is the expectation with respect to the input distribution of the Kullback-Leibler divergence of the conditional class probabilities.If anyone of the conditional class probabilities has zero estimation while the underlying conditional class probability is positive, the risk can even become infinite.To avoid the infinite risk, we truncate the CE loss function and derive convergence rates without assuming either the true conditional class probabilities or the estimators away from zero.Instead, our misclassification risks depend on an index quantifying the behaviour of the conditional class probabilities near zero.
Given an absolute constant C 0 ≥ 2, for any classifier π, define the truncated Kullback-Leibler risk for some density h as Remark 2. C 0 is commonly introduced to avoid infinity value of the ordinary Kullback-Leibler risk.It is trivial to show that C 0 can only be abandoned when π k is lower bounded by exp (−C 0 ) for all k.When π k has a large deviation from π k for some k, the Kullback-Leibler risk explodes to infinity, which makes it is infeasible to evaluate the performance of π.

Convergence rate for fully observed functional data
In this section, we provide the non-asymptotic Kullback-Leibler risk of mfDNN classifier.Let where , where α = min k α k ∧ 1.
Assumption 4.There exist some constants C 1 , C ′ 2 , C 2 , C 3 , only depending on H and C 0 , such that the DNN class F(L, J, p, s) satisfies Assumption 4 provides exact orders on (L, p, s) for network, respectively.Assumption 4(b) provides the precise range on J.It is worth mentioning this condition implies ϱ ≥ θν −1 , i.e., the function ζ(J) converges to zero in a relatively fast rate when J → ∞.
In the following, we provide the convergence rate in the ideal case when the entire functional curve is fully observed.Theorem 3.1.There exist a positive constant ω 1 , only depending on H and C 0 , such that where network classifier π belongs to F(L, J, p, s) in Assumption 4.
Remark 3. Theorem 3.1 provides the upper bound of the KL misclassification risk of the proposed mfDNN.When K = 2, the multiclass classification downgrades to the binary classification problem.Compared with the minimax excess misclassification risk derived in [22] for the binary classification, the upper bound rate in Theorem 3.1 provides a slightly larger order, Specifically, the leading term of the upper bound rate in Theorem 3.1 is n −θ with θ = (1+α)β (1+α)β+t , while the leading terms in Theorem 1 in [22] is n −S0 with S 0 = (α+1)β (α+2)β+t .Hence, when β's, i.e. the degrees of Hölder smooth functions in the class H, are large enough, the discrepancy between these two bounds are rarely negligible.Hence, to reduce potential slightly larger risks, we recommend [22] for the binary classification problems and when the distribution functions are not smooth enough.Meanwhile, we recommend the mfDNN classifier for multiclassification problems regardless the smoothness of the conditional distribution functions.

Convergence rate for discretely observed functional data
Practically, it is usually unrealistic to observe the full trajectory of each individual, thus the rate in Theorem 3.1 can only be reached if sampling frequency is dense enough.Hence, it is interesting to discuss the upper bound of the risk of mfDNN classifier when functional data are discretely observed at m occasions for each subject.Let β = max k=1...,K β , where k is defined in (3.5) and τ is a positive universal constant.
Assumption 5.There exist some constants and C 3 only depending on H, C 0 and τ , and a phase transition point m * ∈ N + , such that the DNN class F(L, J, p, s) satisfies Similar to Assumption 4, Assumption 5 provides exact orders on L, p, s, and range of J when sampling frequency m is involved.When m ≥ m * , Assumption 5 coincides with Assumption 4 for dense functional data.
The following theorem provides the phase transition rate when functional data are discretely observed on m locations at a certain rate with respect to τ .Theorem 3.2.When E|ξ j − ξ j | ≲ m −τ for all j = 1, . . ., J, there exists positive constants ω 1 , ω 2 and ω 3 only depending on H, C 0 and τ , such that where ⌋, and π ∈ F(L, J, p, s) defined in Assumption 5. are both convergent, when λ kj and µ 2 kj are decreasing no faster than some polynomial order, the assumption easily follows.Another well-known example is the FPCA provided by [6].According to Theorem 1 in [6], for all j = 1, . . ., J, the estimators of projection scores satisfy

Examples
In this section, we provide two examples of exponential families to justify the validation of our model assumptions, and emphasize the necessity of applying DNN approach owing to the complicated structure of data population.For simplicity, we assume the prior probability satisfies P(Y = e k ) = k −1 for all k throughout the section.

Independent exponential family
We first consider independent projection scores, which are from exponential families.For some collection of unknown parameters {θ kj } K,∞ k=1,j=1 , and unknown collections of k=1,j=1 , and {W kj } K,∞ k=1,j=1 , we consider the k-th class conditional density, such that } be the two sets identifying the difference between h k and h k ′ .Therefore, we have the pairwise log likelihood k for all J ≥ J max .By definition, Assumption 1 holds for α = 1.Assumption 2 holds when h k /h k ′ is bounded for all pairs.Note that it is trivial when {h k } K k=1 share the same and {W kj } K,∞ k=1,j=1 , such as Gaussian distribution, student's t distribution and exponential distribution, whose density ratio of their kind is always bounded.Assumption 3 holds for J 0 = J max , and for arbitrary function e(•) with exponential tails and den- , the smoothness is determined by k=1,j=1 , and {W kj } K,∞ k=1,j=1 , thus h is trivially in some H.

Exponential family with in-block interaction
In this example, we consider ξ j are dependent with each other in a block, but independent across blocks, which is an extension of the example in Section 4.1.Given a sequence of positive integers {ℓ p } ∞ j=1 , such that 0 = ℓ 1 < ℓ 2 < . .., we define the p-th group index set E p = {ℓ p + 1, . . ., ℓ p+1 }, such that the cardinality |E p | = ℓ p+1 − ℓ p , therein grouping are based on adjacent members for simplicity.For a collection of unknown parameters {θ kj } K,∞ k=1,p=1 , and unknown collections of functions , and , where U kp and W kp are functions from R |Ep| to R.
For any 1 ≤ k, k ′ ≤ K, define the density difference sets and B kk ′ = p : W kp ̸ = W k ′ p , and the pairwise log likelihood is thus given by Given some finite positive number N kk ′ , such that |A kk ′ B kk ′ | ≤ N kk ′ , the verification can be similarly derived from Section 4.1.

Simulation studies
In this section, we provide numerical evidences to demonstrate the superior performance of mfDNN.In all simulations, we generated n k = 200, 350, 700 training samples for each class, and testing sample sizes 100, 150, 300, respectively.Based on the fact that there is no existing multi-classification method specifically designed for multidimensional functional data, for comparison, we include the sparse discriminate analysis and ℓ 1 penalized Fisher's discriminant analysis (MSDA) approach introduced in [14] and penalized linear discriminant analysis (PLDA) classifier in [24].In fact, MSDA and PLDA are efficient classifiers designed for high-dimensional i.i.d.observations.To make these two methods directly applicable to functional data, we first pre-processed 2D or 3D functional data by vectorization.The realization is via the R packages msda and PenalizedLDA, where the tuning parameter candidates for the penalty term are generated by default.We use the default five-fold and six-fold cross-validation to tune MSDA and PLDA, respectively.For mfDNN, we use tensor of Fourier basis to extract projection scores by integration.The structure parameters (L, J, p, s) are selected by Algorithm 1, where the candidates are given based on Theorem 3.2.We summarize R codes and examples for the proposed mfDNN algorithms on GitHub (https://github.com/FDASTATAUBURN/mfdnn).

2D functional data
For k = 1, 2, 3, we generated functional data . Define 1 k be a k × 1 vector with all the elements one.We specify the distribution of ξ 2 , respectively.As a result, the sampling frequency m = 9, 25, 100, 400, which indicates that the functional observations are from sparse to dense.Tables 1 and 2 demonstrate the results of 100 simulations.For mfDNN, it can be seen that the misclassification risks decrease as the sample size n increasing, as well as the increase of the sampling frequency m.This founding further confirms Theorem 3.2.Given the relatively sparse sampling frequency, i.e., m = 9, MSDA has slightly better performance than mfDNN does.However, despite the increase of m in Table 2, there is no improvement of both MSDA and PLDA methods in terms of misclassification risks.This finding indicates that MSDA and PLDA classifiers can not be improved with more gathered information.In summary, the simulation results illustrate that the proposed mfDNN method outperforms the existing sparse and penalized discriminant analysis when classifying 2D dense functional data.
Model 5 (3D Gaussian): Let ξ , where (r 11 , . . ., r 19 ) ⊺ = 0.1× (1,3,5,7,9,11,13,15,17) ⊺ , (ν 21 , . . ., ν 29 ) ⊺ = 0.6 × 1 9 , µ 3 = 0 9 , Σ 1/2 3 = diag (4.5, 4, 3.5, 3, 2.5, 2, 1.5, 1, 0.5).For the 3D functional data, we apply similar setups as 2D cases.We observe the functional data on 2 × 2 × 2, 3 × 3 × 3, 4 × 4 × 4, and 5 × 5 × 5 grid points over [0, 1] 3 , respectively, and the sampling frequency m = 8, 27, 64, 125.Tables 3 and 4 demonstrate the results of 100 simulations.The proposed mfDNN classifier is superior to its counterparts for all 3D functional data cases.Meanwhile, there also exists the phase transition patterns for mfDNN method.However, the performance of MSDA and PLDA methods lacks of improvement with the increase of m.It can be seen that when m = 125, the misclassification error rates of mfDNN are almost one third of MSDA's and one forth of PLDA's in Gaussian case, and almost a half of either MSDA's or PLDA's error rates in Models 7 and 8.A plausible reason is that given the functional data framework, our proposed mfDNN can properly accommodate the repeatedly observed data over pixels or voxels, while other competitors only treat those information as common high-dimensional covariates and ignore the underlining smoothing structures.By efficiently extracting the projection scores of the continuum, the proposed mfDNN has full potential to discover the underlying distributions of the functional data clusters.This again demonstrates our proposed classifier has a distinct advantage over these competitors in complex imaging data classification problems.6.Real data analysis

Handwritten digits
The first benchmark data example was extracted from the MNIST database (http: //yann.lecun.com/exdb/mnist/).This classical MNIST database contains 60,000 training images and 10,000 testing images of handwritten digits (0, 1, . . ., 9), and the black and white images were normalized to fit into a 28×28 pixel bounding box and anti-aliased.We used tensor of Fourier basis for data processing.According to our numerical experience, we choose candidates for (L, J, p, s), such that L = (2, 3, 4) ⊺ , J = (300, 500, 800) ⊺ , ∥p∥ ∞ = (500, 1000, 2000) ⊺ , and s = (0.01, 0.1, 0.5) for dropout rate.Here we abuse the notation of s, as the dropout is the technique we choose to sparsify the neural network.With the optimal parameters L opt = 3, J opt = 500, ∥p opt ∥ ∞ = 1000, s opt = 0.01 through validation, we demonstrate the misclassification risk in Table 5.We estimate the rules given by MSDA, PLDA and our proposal on the training set.As most observations for each subject are zeros, PenalizedLDA reports errors and does not work any more.It can be seen that our proposal achieves the higher accuracy with the sparsest classification rule.This again shows that our method is a competitive classifier and has more broader applications.

ADNI database
The dataset used in the preparation of this article were obtained from the ADNI database (http://adni.loni.usc.eduWe randomly split the datasets with a 7 : 3 ratio in a balanced manner to form the training set and the testing set, with 100 repetitions.We choose candidates for (L, J, p, s), such that L = (2, 3) ⊺ , J = (50, 100, 200) ⊺ , ∥p∥ ∞ = (200, 500, 800) ⊺ , and s = (0.01, 0.1, 0.5) for dropout rate.We still compare our method with MSDA.For 2D case, it means each subject has N = 79 × 95 = 7, 505 observed pixels for each selected image slice.Table 6 displays the miclassification rates for 2D brain imaging data of AD, EMCI and CN.For 3D case, the observed number of voxels for each patient's brain sample is N = 79 × 95 × 68 = 510, 340.Unfortunately, given more than a half million covariates, MSDA method crashed down as such gigantic covariance matrix (almost million by million dimension) needs around 2TB RAM to store.It easily exceeds the memory limit of the common supercomputer.Hence, MSDA for 3D data results are unavailable.Meanwhile, as PLDA recasts Fisher's discriminant problem as a biconvex problem that can be optimized using a simple iterative algorithm, PLDA avoids the heavy computation burden of the covariance matrix and it still works for this 3D case.Table 7 presents the empirical misclassification risk for mfDNN and PLDA.
There are several interesting findings in Tables 6 and 7. First, our proposed classifier has better performance than other competitors in any 2D slice data or 3D data cases.Second, from Table 7, we can conclude that given a single slice of 2D imaging data, the misclassification rates are consistently larger than using the 3D data.It indicates that 3D data contains more helpful information to label the brain images among three stages of the disease.Third, the 10-th and the 20-th slices provide the lowest ones among all 2D data.As a matter of fact, it is well known that Alzheimer's disease destroys neurons and their connections in hippocampus, the entorhinal cortex, and the cerebral cortex.The related regions are corresponding to the first 25 slices.This is a promising finding for neurologists, as this smallest risk indicates this particular slice presents useful information to distinguish the CN, EMCI and AD groups.Further medical checkups are meaningful for this special location in the brain.

Summary
In this paper, we propose a new formulation to derive multiclass classifiers for multidimensional functional data.We show that our proposal has a solid theoretical foundation and can be solved by a very efficient computational algorithm.Our proposal actually gives a unified treatment of both one-dimensional and multidimensional classification problems.In light of this evidence, our proposal is regarded as an efficient multiclass and generalization of the multiclassification methods from i.i.d.data to multidimensional functional data cases.To the best of our knowledge, the present work is the first work on multiclassification for multidimensional functional data with theoretical justification.

Remark 4 .
When functional curves are discretely observed, Theorem 3.2 provides the convergence rate of truncated KL risk when the biases of projection scores are uniformly bounded by m −τ .This assumption does not hold universally for all scenarios.Nevertheless, it can be satisfied in various examples.For any empirical process on [0, 1], if we use Fourier basis with m terms to decompose the curve, we can show that E|ξ j − ξ j | ≤ O max k=1,...,K ∞ j=m+1 λ kj + µ 2 kj , where µ kj = E ξ j in the k-th group.As {λ kj } ∞ j=1 and µ 2 kj ∞ j=1

FIG 2 .
FIG 2. Averaged images of the the 20-th, the 40-th and the 60-th slices of EMCI (left column) group and AD group (right column).

TABLE 1
Averaged misclassification rates with standard errors in brackets for 2D simulations when m = 9 and m = 25 over 100 replicates.

TABLE 2
Averaged misclassification rates with standard errors in brackets for 2D simulations when m = 100 and m = 400 over 100 replicates.

TABLE 3
Averaged misclassification rates with standard errors in brackets for 3D simulations when m = 8 and m = 27 over 100 replicates.

TABLE 4
Averaged misclassification rates with standard errors in brackets for 3D simulations when m = 64 and m = 125 over 100 replicates.

TABLE 5
Classification accuracy for MNIST data.
).The ADNI is a longitudinal multicenter study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of Alzheimer's Disease (AD).From this database, we collect PET data from 79 patients in AD group, 45 patients in Early Mild Cognitive Impairment (EMCI) group , and 101 people in Control (CN) group.This PET dataset has been spatially normalized and post-processed.These AD patients have three to six times doctor visits and we select the PET scans obtained in the third visits.People in EMCI group only have the second visit, and we select the PET scans obtained in the second visits.ForAD group, patients' age ranges from 59 to 88 and average age is 76.49, and there are 33 females and 46 males among these 79 subjects.For EMCI group, patients' age ranges from 57 to 89 and average age is 72.33, and there are 26 females and 19 males among these 45 subjects.For CN group, patients' age ranges from 62 to 87 and average age is 75.98, and there are 40 females and 61 males among these 101 subjects.All scans were reoriented into 79 × 95 × 68 voxels, which means each patient has 68 sliced 2D images with 79 × 95 pixels.For 2D case, it means each subject has N = 79 × 95 = 7, 505 observed pixels for each selected image slice.For 3D case, the observed number of voxels for each patient's brain image observation is N = 79 × 95 × 68 = 510, 340.

TABLE 6
Averaged misclassification rates with standard errors in brackets for ADNI 2D brain images.

TABLE 7
Averaged misclassification rates with standard errors in brackets for ADNI 3D brain images.