Statistical Science

Sequential Approach for Identifying Lead Compounds in Large Chemical Databases

Markus Abt, YongBin Lim, Jerome Sacks, Minge Xie, and S. Stanley Young

Full-text: Open access


At the early stage of drug discovery, many thousands of chemical compounds can be synthesized and tested (assayed) for potency (activity) with high throughput screening (HTS). With ever-increasing numbers of compounds to be tested (now often in the neighborhood of 500,000) it remains a challenge to find strategies via sequential design that reduce costs while locating classes of active compounds. Initial screening of a modest number of selected compounds (first-stage) is used to construct a structure-activity relationship (SAR). Based on this model, a second-stage sample is selected, the SAR updated and, if no more sampling is done, the activities of not yet tested compounds are predicted. Instead of stopping, the SAR could be used to determine another stage of sampling after which the SAR is updated and the process repeated.

We use existing data on the potency and chemical structure of 70,223 compounds to investigate various sequential testing schemes. Evidence on two assays supports the conclusion that a rather small number of samples selected according to the proposed scheme can more than triple the rate at which active compounds are identified and also produce SARs effective for identifying chemical structure. A different set of 52,883 compounds is used to confirm our findings.

One surprising conclusion of the study is that the design of the initial sample stage may be unimportant: random selection or systematic methods based on chemical structures are equally effective.

Article information

Statist. Sci., Volume 16, Issue 2 (2001), 154-168.

First available in Project Euclid: 24 December 2001

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Combinatorial chemistry data mining high throughput screening recursive partitioning sequential design structure-activity relationship


Abt, Markus; Lim, YongBin; Sacks, Jerome; Xie, Minge; Young, S. Stanley. Sequential Approach for Identifying Lead Compounds in Large Chemical Databases. Statist. Sci. 16 (2001), no. 2, 154--168. doi:10.1214/ss/1009213288.

Export citation


  • Box, G. E. P. and Draper, N. R. (1987). Empirical Modelbuilding and Response Surfaces. Wiley, New York. Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J.
  • (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Burbaum, J. J. (1998). Miniaturization technologies in HTS: How fast, how small, how soon? Drug Discov. Today 3 313-322.
  • Burden, F. R. (1989). Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29 225-227.
  • Carhart, R. E., Smith, D. H. and Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. 25 64-73.
  • Cho, S. J., Shen, C. F. and Hermsmeier, M. A. (2000). Binary formal inference-based recursive modeling using multiple atom and physiochemical property class pair and torsion descriptors as decision criteria. J. Chem. Inf. Comput. Sci. 40 668-680.
  • Cortese, R. (ed.) (1996). Combinatorial Libraries. Synthesis, Screening and Application Potential. de Gruyter, Berlin.
  • Finn, P. W. (1996). Computer-based screening of compound databases for the identification of novel leads. Drug Discov. Today 1 363-370.
  • Friedman, J. H. and Fisher, N. I. (1999). Bump hunting in high-dimensional data. Statist. Comput. 9 123-143 (with discussion).
  • Gobbi, A., Poppinger, D. and Rohde, B. (1997). Finding biological active compounds in large databases. Available at
  • Gundertofte, K. and Jørgensen, F. S. (eds.) (2000). Molecular modeling and prediction of bioactivity. In Proceedings of the Twelfth European Symposium on Quantitative Structure Activity Relationships. Plenum, New York. Haaland, P. D., McMillan, N. J., Nychka, D. W. and Welch,
  • W. J. (1994). Analysis of space filling designs. Comput. Sci. Statist. 26 111-120.
  • Hawkins, D. M. (1994). FIRM formal inference-based recursive modeling. Release 2, Univ. Minnesota, St. Paul.
  • Hawkins, D. M. and Kass, G. V. (1982). Automatic interaction detection. In Topics in Applied Multivariate Analysis (D. M. Hawkins, ed.) 269-302. Cambridge Univ. Press.
  • Helland, I. S. (1990). Partial least squares regression and statistical models. Scand. J. Statist. 17 97-114.
  • Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. 44 223-270.
  • Johnson, M. E., Moore, L. M. and Ylvisaker, D. (1990). Minimax and maximin distance designs. J. Statist. Plann. Inference 26 131-148. Jones-Hertzog, D. K., Mukhopadhyay, P., Keefer, C. E. and
  • Young, S. S. (2000). Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. J. Pharmacol. Toxicol. 10 207-215.
  • Kauffman, G. W. and Jurs, P. C. (2000). Prediction of inhibition of the sodium ion-proton antiporter by benzoylguanidine derivatives from molecular structure. J. Chem. Inf. Comput. Sci. 40 753-761.
  • Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data. Wiley Interscience, New York.
  • Patankar, S. J. and Jurs, P. C. (2000). Prediction of IC50 values for ACAT inhibitors from molecular structure. J. Chem. Inf. Comput. Sci. 40 706-723. Rusinko, A. III, Farmen, M. W., Lambert, C. G., Brown, P. L.
  • and Young, S. S. (1999). Analysis of a large structure/biological activity data set using recursive partitioning. J. Chem. Inf. Comput. Sci. 39 1017-1026.
  • Service, R. F. (1996). Combinatorial chemistry hits the drug market. Science 272 1266-1268.
  • Tatsuoka, K., Gu, C., Sacks, J. and Young, S. S. (2000). Predicting extreme values in large data sets. J. Comput. Graph. Statist. Unpublished manuscript.
  • Van Drie, J. H. and Lajiness, M. S. (1998). Approaches to virtual library design. Drug Discov. Today 3 274-283.
  • Weinstein, J. N. (1996). A new QSAR algorithm combining principal component analysis with a neural network: application to calcium channel antagonists. Available at
  • Walters, W. P., Stahl, M. T. and Murko, M. A. (1998). Virtual screening: an overview. Drug Discov. Today 3 160-178.
  • Westfall, P. H. and Young, S. S. (1993). Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, New York.
  • Young, S. S., Farmen, M. W. and Rusinko, A. III (1996). Random versus rational: which is better for general compound screening? Network Science. Available at http://www.