A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology


The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational biology and allied sciences. While the dimensionality of such datasets continues to grow, so too does the complexity of biomarker identification from exposure patterns in health studies measuring baseline confounders; moreover, doing so while avoiding model misspecification remains an issue only partially addressed. Efficient estimators capable of incorporating flexible, data adaptive regression techniques in estimating relevant components of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems that require the simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control even under standard multiple testing corrections. We present a general approach for applying empirical Bayes shrinkage to variance estimators of a family of efficient, asymptotically linear estimators of population intervention causal effects. Our generalization of shrinkage-based variance estimators increases inferential stability in high-dimensional settings, facilitating the application of these estimators for deriving nonparametric variable importance measures in high-dimensional biological datasets with modest sample sizes. The result is a data adaptive approach for robustly uncovering stable causal associations in high-dimensional data in studies with limited samples. Our generalized variance estimator is evaluated against alternative variance estimators in numerical experiments. Identification of biomarkers with the proposed methodology is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.