00:00:00 / 00:00:00

Exploring the presence of complex dependence structures in epidemiological and genomic data through flexible clustering

By Sylvia Richardson

Appears in collection : Thematic month on statistics - Week 5: Bayesian statistics and algorithms / Mois thématique sur les statistiques - Semaine 5 : Semaine Bayésienne et algorithmes

Faced with data containing a large number of inter-related explanatory variables, finding ways to investigate complex multi-factorial effects is an important statistical task. This is particularly relevant for epidemiological study designs where large numbers of covariates are typically collected in an attempt to capture complex interactions between host characteristics and risk factors. A related task, which is of great interest in stratified medicine, is to use multi-omics data to discover subgroups of patients with distinct molecular phenotypes and clinical outcomes, thus providing the potential to target treatments more precisely. Flexible clustering is a natural way to tackle such problems. It can be used in an unsupervised or a semi-supervised manner by adding a link between the clustering structure and outcomes and performing joint modelling. In this case, the clustering structure is used to help predict the outcome. This latter approach, known as profile regression, has been implemented recently using a Bayesian non parametric DP modelling framework, which specifies a joint clustering model for covariates and outcome, with an additional variable selection step to uncover the variables driving the clustering (Papathomas et al, 2012). In this talk, two related issues will be discussed. Firstly, we will focus on categorical covariates, a common situation in epidemiological studies, and examine the relation between: (i) dependence structures highlighted by Bayesian partitioning of the covariate space incorporating variable selection; and (ii) log linear modelling with interaction terms, a traditional approach to model dependence. We will show how the clustering approach can be employed to assist log-linear model determination, a challenging task as the model space becomes quickly very large (Papathomas and Richardson, 2015). Secondly, we will discuss clustering as a tool for integrating information from multiple datasets, with a view to discover useful structure for prediction. In this context several related issues arise. It is clear that each dataset may carry a different amount of information for the predictive task. Methods for learning how to reweight each data type for this task will therefore be presented. In the context of multi-omics datasets, the efficiency of different methods for performing integrative clustering will also be discussed, contrasting joint modelling and stepwise approaches. This will be illustrated by analysis of genomics cancer datasets. Joint work with Michael Papathomas and Paul Kirk.

Information about the video

Citation data

  • DOI 10.24350/CIRM.V.18937503
  • Cite this video Richardson, Sylvia (29/02/2016). Exploring the presence of complex dependence structures in epidemiological and genomic data through flexible clustering. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.18937503
  • URL https://dx.doi.org/10.24350/CIRM.V.18937503

Bibliography

  • Chung, Y., & Dunson, D.B. (2009). Nonparametric Bayes conditional distribution modelling with variable selection. Journal of the American Statistical Association, 104(488), 1646-1660 - http://dx.doi.org/10.1198/jasa.2009.tm08302
  • Kirk, P., Griffin, J.E., Savage, R., Ghahramani, Z., & Wild, D.L. (2012). Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, 28(24), 3290-3297 - http://dx.doi.org/10.1093/bioinformatics/bts595
  • Liverani, S., Hastie, D.I., Papathomas, M., & Richardson, S. (2015). PReMiuM: An R package for profile regression mixture models using Dirichlet processes. Journal of Statistical Software, 64(7) - http://dx.doi.org/10.18637/jss.v064.i07
  • Molitor, J.T., Papathomas, M., Jerrett, M., & Richardson, S. (2010). Bayesian profile regression with an application to the national survey of children's health. Biostatistics, 11(3), 484-498 - http://dx.doi.org/10.1093/biostatistics/kxq013
  • Papathomas, M., Molitor, J., Richardson, S., Riboli E., & Vineis P. (2011) Examining the joint effect of multiple risk factors using exposure risk profiles : lung cancer in non smokers. Environmental Health Perspectives, 119,84-91 - http://dx.doi.org/10.1289/ehp.1002118
  • Papathomas, M. , Molitor, J., Hoggart, C., Hastie, D., & Richardson, S. (2012). Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene-gene patterns. Genetic Epidemiology, 36(6), 663-674 - http://dx.doi.org/10.1002/gepi.21661
  • Papathomas, M. & Richardson, S. (2015). On the utility of the Dirichlet process for linear model determination: application to graphical log-linear model determination. To appear in Journal of Statistical Planning and Inference. <arXiv:1401.7214> - http://arxiv.org/abs/1401.7214
  • Papathomas, M. & Richardson, S. (2016). Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms. Journal of Statistical Planning and Inference, 173, 47-63 - http://dx.doi.org/10.1016/j.jspi.2016.01.002

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow




Register

  • Bookmark videos
  • Add videos to see later &
    keep your browsing history
  • Comment with the scientific
    community
  • Get notification updates
    for your favorite subjects
Give feedback