Research

 
 

Major Research Areas 

Bayesian Methods, Decision Theory, Statistical Learning Theory, Biostatistics, High-dimensional Data Analysis. 

Specific Research Interests 

My research interests encompass many areas in the field of Statistics, from classical decision theory and model selection to Bayesian methodology and statistical learning applications. All of these are to answer the question raised with the advent of high-throughput data gathering and computing technology: How can we statistically model high-dimensional and large-scale data and make inference under uncertainty when classical methodologies are not feasible? In particular, the following topics are my foci.
Bayesian Multiple Decision Functions. My main research focuses on the theory and computation of multiple decision problems, especially from Bayesian perspective. An example of a multiple decision problem is in a genetic study to determine which among a multitude of genes are differentially expressed across treatment groups on the basis of a microarray data set. In the context of multiple decision making, a Type I error occurs when a gene is falsely declared to be differentially expressed, whereas a Type II error occurs when a gene is falsely declared to be identically expressed. In the paper published in Annals of Statistics in 2011, with the formulation of multiple hypothesis testing problems, we considered optimal procedures that enhance the power of the tests regarding Type II error rate while controlling Type I error rate. In the paper published in Metrika in 2015, we extended this result to a flexible class of such procedures. On the other hand, we also considered a class of weighted loss functions of both Type I and Type II error rates. Through Bayesian decision framework, Bayes multiple decision function (BMDF) was derived optimal to the combined loss functions, and an efficient algorithm to obtain the optimal Bayes action was developed. An important contribution of this work is that the searching complexity of the BMDF increases with the dimensionality of the data set, but at a computationally acceptable rate. In contrast to many works in the literature where data from different genes are assumed to be stochastically independent, we allow a dependent data structure with the associations obtained through a class of frailty-induced Archimedean copulas. Therefore, in particular, non-Gaussian dependent data structure, which is typical with failure-time data, can be applied. This work was published in Electronic Journal of Statistics in 2013.
In addition to the multiple testing problems, BMDF can also be applied to multiple classifi- cation problems. An example is to predict disease statuses of a group of new patients based on their clinical and genetic data, as well as that of previously studied patients with known disease statuses. In the context of multiple classification, the class of weighted loss functions of both Type I and Type II was also considered, where a Type I error occurs when a patient is falsely classi- fied to have the disease, whereas a Type II error occurs when a patient is falsely classified to be healthy. Through the Bayesian decision framework and focused on generalized linear models, the Bayes multiple classification function (BMCF) was derived. Although still a multiple decision prob- lem, multiple classification is more challenging than multiple hypothesis testing because of possible high dimensional predictors as well as uncertainty of link functions. We adopted Bayesian model selection and model averaging methods to select optimal predictors and link functions, coupled with classifications. This procedure was confirmed to be flexible and effective in simulation studies as well as real data applications. The manuscript is in revision for submission to Computational Statistics & Data Analysis. I am also working on a specific type of classification problem, where the predictors are all binary, such as having a symptom or not, and the interactions among these predictors are nonlinear and through Boolean logic expressions. For example, a person is at a higher risk if having Symptom A or B. We proposed a two-step procedure that combines Bayesian model selection and frequent pattern mining to deal with the high number of possible Boolean expressions. I was invited to present this work at 2016 ICSA Applied Statistics Symposium.
Model Selection and Estimation for Quantal-Response Data. In environmental risk assessment, we considered the estimation of minimum exposure levels, called benchmark doses (BMDs), which induce a pre-specified benchmark response in a dose-response experiment with quantal-response data. In such settings, representations of the risk are traditionally based on a specified parametric model. It is a well-known concern, however, that existing parametric estima- tion techniques are sensitive to the form used for modeling the dose response. We have assessed the impact of model misspecification and uncertainty via simulation studies, and developed a frequen- tist model averaging approach for estimating BMDs, on the basis of information-theoretic weights. These works were published in 2012 and 2013 in Environmetric. In addition, we developed a new model selection and model averaging approach based on focused information theory. This approach is superior to the other methods because it focuses on directly estimating BMDs rather than first assessing model fitting over all doses. This work has been accepted by Risk Analysis. I plan to work on a bootstrap approach to estimate the standard error of BMDs post the new approach of model selection under uncertainty. The application of this methodology can improve environmental health planning and risk regulation when dealing with low-level exposures to hazardous agents.
High-Dimensional Data Analysis Applications. Most of my collaborative researches with non-statistical scientists and engineers involve high-dimensional data analysis. These applications not only let me contribute to the specific projects with my specialty, but more importantly, moti- vated and facilitated me to develop new methodologies. For example, the projects with Dr. DeEtta Millers in the biology department and the project with medical researcher Dr. Roberto Vazquez- Padron from University of Miami provided me real medical data along with challenging practical high-dimensional variable selection problems; the collaborations with CATE lab in the engineer- ing school lead by Dr.Malek Adjouadi on brain imaging and the project with computer scientist Dr. Giri Narasimhan on bioinformatics inspired me to apply various data mining techniques to my current and future research in statistical learning. I am supported by an NIH funded R01 project as a Co-I, and have submitted thirteen grant proposals for external fundings as Co-I or Co-PI, one R01 project funded recently, and three under review. I look forward to continuing working with these labs and groups as a team scientist solving data-related scientific problems.

The picture at the top shows the changes of gene expression of 22*5 genes among thousands from a microarray. Red dot shows increased, green shows decreased level of expression.