Statistical modeling: the two cultures (with comments and a rejoinder by the author)

Statistical Science, Vol. 16, No. 3. (August 2001), pp. 199-231,


There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in ...


Iterative random forests to discover predictive and stable high-order interactions

Proceedings of the National Academy of Sciences, Vol. 115, No. 8. (20 February 2018), pp. 1943-1948,


[Significance] We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing. [Abstract] Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, ...


Classification and interaction in random forests

Proceedings of the National Academy of Sciences, Vol. 115, No. 8. (20 February 2018), pp. 1690-1692,


Suppose you are a physician with a patient whose complaint could arise from multiple diseases. To attain a specific diagnosis, you might ask yourself a series of yes/no questions depending on observed features describing the patient, such as clinical test results and reported symptoms. As some questions rule out certain diagnoses early on, each answer determines which question you ask next. With about a dozen features and extensive medical knowledge, you could create a simple flow chart to connect and order ...


European Forest Types: toward an automated classification

Annals of Forest Science, Vol. 75, No. 1. (2018), pp. 1-14,


[Key message] The outcome of the present study leads to the application of a spatially explicit rule-based expert system (RBES) algorithm aimed at automatically classifying forest areas according to the European Forest Types (EFT) system of nomenclature at pan-European scale level. With the RBES, the EFT system of nomenclature can be now easily implemented for objective, replicable, and automatic classification of field plots for forest inventories or spatial units (pixels or polygons) for thematic mapping. [Context] Forest Types classification systems are aimed at stratifying ...


Random forests

Machine Learning, Vol. 45, No. 1. (2001), pp. 5-32,


Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random ...


