Exposés de recherche

Collection Exposés de recherche

00:00:00 / 00:00:00
171 380

Random forests variable importances: towards a better understanding and large-scale feature selection

By Pierre Geurts

Also appears in collection : Thematic month on statistics - Week 1: Statistical learning / Mois thématique sur les statistiques - Semaine 1 : apprentissage

Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of view. In this talk, I will present our recent works towards a better understanding, and consequently a better exploitation, of these measures. In the first part of my talk, I will present a theoretical analysis of the mean decrease impurity importance in asymptotic ensemble and sample size conditions. Our main results include an explicit formulation of this measure in the case of ensemble of totally randomized trees and a discussion of the conditions under which this measure is consistent with respect to a common definition of variable relevance. The second part of the talk will be devoted to the analysis of finite tree ensembles in a constrained framework that assumes that each tree can be built only from a subset of variables of fixed size. This setting is motivated by very high dimensional problems, or embedded systems, where one can not assume that all variables can fit into memory. We first consider a simple method that grows each tree on a subset of variables randomly and uniformly selected among all variables. We analyse the consistency and convergence rate of this method for the identification of all relevant variables under various problem and algorithm settings. From this analysis, we then motivate and design a modified variable sampling mechanism that is shown to significantly improve convergence in several conditions.

Information about the video

Citation data

  • DOI 10.24350/CIRM.V.18920603
  • Cite this video Geurts, Pierre (02/02/2016). Random forests variable importances: towards a better understanding and large-scale feature selection. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.18920603
  • URL https://dx.doi.org/10.24350/CIRM.V.18920603

Bibliography

  • [1] Châtel C., Sélection de variables à grande échelle à partir de forêts alétoires, Master thesis, Ecole Centrale de Marseille/Universite de Liege, 2015
  • [2] Huynh-Thu, V. A., Saeys, Y., Wehenkel, L., & Geurts, P. (2012). Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics 28(13), 1766-1774 - http://dx.doi.org/10.1093/bioinformatics/bts238
  • [3] Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods, Plos ONE 5(9), e12776 - http://dx.doi.org/10.1371/journal.pone.0012776
  • [4] Louppe G. Understanding random forests: From theorey to practice, PhD thesis, University of Liege, 2014. <arXiv:1407.7502> - http://arxiv.org/abs/1407.7502v3
  • [5] Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In C. J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K.Q. Weinberger (Eds.), Advances in neural information processing 26 (pp. 431-439). Curran Associates - http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf
  • [6] Louppe, G., & Geurts, P. (2012). Ensembles on random patches. In P.A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 346-361). Berlin: Springer. (Lecture Notes in Computer Science, 7523) - http://dx.doi.org/10.1007/978-3-642-33460-3_28
  • [7] Marbach, D., Costello, J.C, Kuffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., & Stolovitzky, G. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9(8), 796-804 - http://dx.doi.org/10.1038/nmeth.2016

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow




Register

  • Bookmark videos
  • Add videos to see later &
    keep your browsing history
  • Comment with the scientific
    community
  • Get notification updates
    for your favorite subjects
Give feedback