With applications to cluster analysis in mind, we suggest new approaches to constructing tree diagrams that describe associations among points in a scatterplot. Our most basic tree diagram results in two data points being associated with one another if and only if their respective curves of steepest ascent up the density or intensity surface lead toward the same mode. The representation, in the sample space, of the set of steepest ascent curves corresponding to the data, is called the gradient tree. It has a regular, octopuslike structure, and is consistently estimated by its analogue computed from a nonparametric estimator which gives consistent estimation of both the density surface and its derivatives. We also suggest ‘forests’, in which data are linked by line segments which represent good approximations to portions of the population gradient tree. A forest is closely related to a minimum spanning tree, or MST, defined as the graph of minimum total length connecting all sample points. However, forests use a larger bandwidth for constructing the density-surface estimate than is implicit in the MST, with the result that they are substantially more orderly and are more readily interpreted. The effective bandwidth for the MST is so small that even the corresponding density-surface estimate, let alone its derivatives, is inconsistent. As a result, relationships that are suggested by the MST can change considerably if relatively small quantities of data are added or removed. Our trees and forests do not suffer from this problem. They are related to the concept of gradient traces, introduced by Wegman, Carr and Luo (1993) and Wegman and Carr (1993) for purposes quite different from our own.
Digital Object Identifier: 10.1214/lnms/1196285394