Exposés de recherche

Collection Exposés de recherche

00:00:00 / 00:00:00
171 380

Random forests variable importances: towards a better understanding and large-scale feature selection

De Pierre Geurts

Apparaît également dans la collection : Thematic month on statistics - Week 1: Statistical learning / Mois thématique sur les statistiques - Semaine 1 : apprentissage

Random forests are among the most popular supervised machine learning methods. One of their most practically useful features is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems, notably in bioinformatics, but they are still not well understood from a theoretical point of view. In this talk, I will present our recent works towards a better understanding, and consequently a better exploitation, of these measures. In the first part of my talk, I will present a theoretical analysis of the mean decrease impurity importance in asymptotic ensemble and sample size conditions. Our main results include an explicit formulation of this measure in the case of ensemble of totally randomized trees and a discussion of the conditions under which this measure is consistent with respect to a common definition of variable relevance. The second part of the talk will be devoted to the analysis of finite tree ensembles in a constrained framework that assumes that each tree can be built only from a subset of variables of fixed size. This setting is motivated by very high dimensional problems, or embedded systems, where one can not assume that all variables can fit into memory. We first consider a simple method that grows each tree on a subset of variables randomly and uniformly selected among all variables. We analyse the consistency and convergence rate of this method for the identification of all relevant variables under various problem and algorithm settings. From this analysis, we then motivate and design a modified variable sampling mechanism that is shown to significantly improve convergence in several conditions.

Informations sur la vidéo

Données de citation

  • DOI 10.24350/CIRM.V.18920603
  • Citer cette vidéo Geurts, Pierre (02/02/2016). Random forests variable importances: towards a better understanding and large-scale feature selection. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.18920603
  • URL https://dx.doi.org/10.24350/CIRM.V.18920603

Bibliographie

  • [1] Châtel C., Sélection de variables à grande échelle à partir de forêts alétoires, Master thesis, Ecole Centrale de Marseille/Universite de Liege, 2015
  • [2] Huynh-Thu, V. A., Saeys, Y., Wehenkel, L., & Geurts, P. (2012). Statistical interpretation of machine learning-based feature importance scores for biomarker discovery, Bioinformatics 28(13), 1766-1774 - http://dx.doi.org/10.1093/bioinformatics/bts238
  • [3] Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods, Plos ONE 5(9), e12776 - http://dx.doi.org/10.1371/journal.pone.0012776
  • [4] Louppe G. Understanding random forests: From theorey to practice, PhD thesis, University of Liege, 2014. <arXiv:1407.7502> - http://arxiv.org/abs/1407.7502v3
  • [5] Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In C. J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K.Q. Weinberger (Eds.), Advances in neural information processing 26 (pp. 431-439). Curran Associates - http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf
  • [6] Louppe, G., & Geurts, P. (2012). Ensembles on random patches. In P.A. Flach, T. De Bie, & N. Cristianini (Eds.), Machine Learning and Knowledge Discovery in Databases (pp. 346-361). Berlin: Springer. (Lecture Notes in Computer Science, 7523) - http://dx.doi.org/10.1007/978-3-642-33460-3_28
  • [7] Marbach, D., Costello, J.C, Kuffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison, K.R., Kellis, M., Collins, J.J., & Stolovitzky, G. (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 9(8), 796-804 - http://dx.doi.org/10.1038/nmeth.2016

Dernières questions liées sur MathOverflow

Pour poser une question, votre compte Carmin.tv doit être connecté à mathoverflow

Poser une question sur MathOverflow




Inscrivez-vous

  • Mettez des vidéos en favori
  • Ajoutez des vidéos à regarder plus tard &
    conservez votre historique de consultation
  • Commentez avec la communauté
    scientifique
  • Recevez des notifications de mise à jour
    de vos sujets favoris
Donner son avis