Exploring the presence of complex dependence structures in epidemiological and genomic data through flexible clustering | Vidéo | Carmin.tv

00:00:00 / 00:00:00

Exploring the presence of complex dependence structures in epidemiological and genomic data through flexible clustering

By Sylvia Richardson

Appears in collection : Thematic month on statistics - Week 5: Bayesian statistics and algorithms / Mois thématique sur les statistiques - Semaine 5 : Semaine Bayésienne et algorithmes

Faced with data containing a large number of inter-related explanatory variables, finding ways to investigate complex multi-factorial effects is an important statistical task. This is particularly relevant for epidemiological study designs where large numbers of covariates are typically collected in an attempt to capture complex interactions between host characteristics and risk factors. A related task, which is of great interest in stratified medicine, is to use multi-omics data to discover subgroups of patients with distinct molecular phenotypes and clinical outcomes, thus providing the potential to target treatments more precisely. Flexible clustering is a natural way to tackle such problems. It can be used in an unsupervised or a semi-supervised manner by adding a link between the clustering structure and outcomes and performing joint modelling. In this case, the clustering structure is used to help predict the outcome. This latter approach, known as profile regression, has been implemented recently using a Bayesian non parametric DP modelling framework, which specifies a joint clustering model for covariates and outcome, with an additional variable selection step to uncover the variables driving the clustering (Papathomas et al, 2012). In this talk, two related issues will be discussed. Firstly, we will focus on categorical covariates, a common situation in epidemiological studies, and examine the relation between: (i) dependence structures highlighted by Bayesian partitioning of the covariate space incorporating variable selection; and (ii) log linear modelling with interaction terms, a traditional approach to model dependence. We will show how the clustering approach can be employed to assist log-linear model determination, a challenging task as the model space becomes quickly very large (Papathomas and Richardson, 2015). Secondly, we will discuss clustering as a tool for integrating information from multiple datasets, with a view to discover useful structure for prediction. In this context several related issues arise. It is clear that each dataset may carry a different amount of information for the predictive task. Methods for learning how to reweight each data type for this task will therefore be presented. In the context of multi-omics datasets, the efficiency of different methods for performing integrative clustering will also be discussed, contrasting joint modelling and stepwise approaches. This will be illustrated by analysis of genomics cancer datasets. Joint work with Michael Papathomas and Paul Kirk.

Information about the video

Date of recording 29/02/2016
Date of publication 16/03/2016
Institution CIRM
Licence CC BY NC ND
Language English
Audience Researchers
Director(s) Guillaume Hennenfent
Format MP4

Citation data

DOI 10.24350/CIRM.V.18937503
Cite this video Richardson, Sylvia (29/02/2016). Exploring the presence of complex dependence structures in epidemiological and genomic data through flexible clustering. CIRM. Audiovisual resource. DOI: 10.24350/CIRM.V.18937503
URL https://dx.doi.org/10.24350/CIRM.V.18937503

Domain(s)

Bibliography

Chung, Y., & Dunson, D.B. (2009). Nonparametric Bayes conditional distribution modelling with variable selection. Journal of the American Statistical Association, 104(488), 1646-1660 - http://dx.doi.org/10.1198/jasa.2009.tm08302
Kirk, P., Griffin, J.E., Savage, R., Ghahramani, Z., & Wild, D.L. (2012). Bayesian correlated clustering to integrate multiple datasets. Bioinformatics, 28(24), 3290-3297 - http://dx.doi.org/10.1093/bioinformatics/bts595
Liverani, S., Hastie, D.I., Papathomas, M., & Richardson, S. (2015). PReMiuM: An R package for profile regression mixture models using Dirichlet processes. Journal of Statistical Software, 64(7) - http://dx.doi.org/10.18637/jss.v064.i07
Molitor, J.T., Papathomas, M., Jerrett, M., & Richardson, S. (2010). Bayesian profile regression with an application to the national survey of children's health. Biostatistics, 11(3), 484-498 - http://dx.doi.org/10.1093/biostatistics/kxq013
Papathomas, M., Molitor, J., Richardson, S., Riboli E., & Vineis P. (2011) Examining the joint effect of multiple risk factors using exposure risk profiles : lung cancer in non smokers. Environmental Health Perspectives, 119,84-91 - http://dx.doi.org/10.1289/ehp.1002118
Papathomas, M. , Molitor, J., Hoggart, C., Hastie, D., & Richardson, S. (2012). Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene-gene patterns. Genetic Epidemiology, 36(6), 663-674 - http://dx.doi.org/10.1002/gepi.21661
Papathomas, M. & Richardson, S. (2015). On the utility of the Dirichlet process for linear model determination: application to graphical log-linear model determination. To appear in Journal of Statistical Planning and Inference. <arXiv:1401.7214> - http://arxiv.org/abs/1401.7214
Papathomas, M. & Richardson, S. (2016). Exploring dependence between categorical variables: benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms. Journal of Statistical Planning and Inference, 173, 47-63 - http://dx.doi.org/10.1016/j.jspi.2016.01.002

MSC codes

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow

Copyright Carmin.tv 2026

Give feedback