Dissertation Defense by Siyu Zhou
Thursday, June 16th, 2022, at 10:30 a.m.
Wesley W. Posvar Hall, Department of Statistics, Seminar Room
Title: Random Forests and Regularization
Abstract: Random forests have a long-standing reputation as excellent off-the-shelf statistical learning methods. Despite their empirical success and numerous studies on their statistical properties, a full and satisfying explanation for their success has yet to be put forth. This work takes a step in this direction by demonstrating that random-feature-subsetting provides an implicit form of regularization, making random forests more advantageous in low signal-to-noise ratio (SNR) settings. Moreover, this is not a tree-specific finding but can be extended to ensembles of base learners constructed in a greedy fashion. Inspired by this, we find inclusion of additional noise features can serve as another implicit form of regularization and thereby lead to substantially more accurate models. As a result, intuitive notions of variable importance based on improved model accuracy may be deeply flawed, as even purely random noise can routinely register as statistically significant. Along these lines, we further investigate the effect of pruning trees in random forests. Despite the fact that full depth trees are recommended in many textbooks, we show that tree depth should be seen as a natural form of regularization across the entire procedure with shallow trees preferred in low SNR settings.
Committee Chair and Advisor: Dr. Lucas Mentch