Life Sciences Building, Room 206
501 S. Nedderman Drive
Box 19047
Arlington, TX 76019
DIVISION OF DATA SCIENCE EVENTS
Unless specified below, all seminars will start at 3:30 pm Fridays in Pickard Hall, Room 110:
FALL 2024
Title: "Generalised Long Memory Time Series and Applications: An Overview"
When: November 15th, 2024
Shelton Peiris
Visiting Professor at the Department of Statistics, UC Davis
Abstract: "Analysis of long memory time series became very popular among the theoretical and applied researchers in the last 2-3 decades due to its flexibility in many applications in almost every field. In this paper, a particular attention has been paid to the development of Generalised Long Memory time series generated by Gegenbauer polynomials and Autoregressive Moving Average (ARMA) models. Several estimation methods will be discussed and recent applications in various fields will be presented. A multivariate or vector extension to GARMA family (ie. Vector GARMA or VEGARMA) will be introduced along with the relevant theoretical properties and applications. "
Title: "Statistical Modeling of Topological Features in Medical Imaging: Enhancing Prognostic Precision and Interpretation"
When: November 8th, 2024
Chul Moon
Associate Professor in the Department of Statistics and Data Science at SMU
Abstract: "Tumor shape significantly influences growth and metastasis. We introduce a topological feature obtained by persistent homology to characterize tumor progression in pathology and radiology images, focusing on its influence on time-to-event data. These topological features, invariant to scale-preserving transformations, capture diverse tumor shape patterns. We introduce a functional spatial Cox proportional hazards model that represents these topological features in a functional space, utilizing them as functional predictors alongside their spatial locations. This model allows for interpretable analysis of the relationship between topological shape features and survival risks."
Title: "An epigenetic view of aging dynamics"
When: November 1st, 2024
Feng Gao
Assistant Professor in the Department of Environmental Health Sciences and Department of Molecular and Medical Pharmacology at UCLA
Abstract: "Aging is one of the most important risk factors for many diseases, and is a complex process that involves multiple factors from genetics to environmental and lifestyle factors. In the meanwhile, aging-related epigenetic changes such as DNA methylation can participate in the regulation of the aging process. Therefore, understanding the epigenetic mechanisms including the dynamics in aging will provide new insights into developing new approaches for disease prevention. Indeed, the rapid development of DNA methylation analysis has provided rich information about epigenetic regulations. For example, DNA methylation analysis can measure more than 450K and 850K CpG sites through microarray technology. These data provide great opportunities to study aging, however, also pose great challenges in learning useful information from high dimensional data. In this talk, I’ll share our recent research on aging. Specifically, I’ll talk about how we leverage novel computational models to reveal biological patterns and decipher the complex information embedded in high dimensional epigenetics data to study aging. I’ll discuss our findings about the dynamics of aging process. Finally, I’ll talk about our future directions in leveraging multi-omics data for aging studies."
Title: "Some Issues and Challenges in the analysis of Biomedical Data"
When: October 25th, 2024
Zhezhen Jin
Professor of Biostatistics in the Department of Biostatistics in Mailman School of Public Health at Columbia University
Abstract: "It is essential to incorporate basic statistical principles and ideas in data analysis. In the analysis of biomedical data, it is often encountered to compare and identify biomarkers that are more informative to disease diagnosis and monitoring, and to evaluate various treatment procedure and plan on health outcome. After a discussion on the issues and challenges with some real examples, I will review available statistical methods and present our newly developed semiparametric statistical methods that are useful for item reduction, differentiation of significant exposure factors and high dimensional data analysis."
Title:
"3D reconstruction of spatial transcriptomics with spatial pattern enhanced graph convolutional neural network"
When: October 4th, 2024
Lin Xu
Assistant Professor in the Department of Health Data Sciences and Biostatistics at SPH, the Department of Pediatrics at Medical School, and a member of the Quantitative Biomedical Research Center (QBRC), and the Harold C. Simmons Cancer Center at UT Southwestern Medical Center
Abstract: "Existing statistical and deep learning algorithms used for analyzing spatially resolved transcriptomics (SRT) data rely solely on two-dimensional (2D) spatial coordinates, which limits their ability to accurately identify three-dimensional (3D) spatial patterns. To address this limitation, we introduced Spa3D, which utilized anti-leakage Fourier transform and graph convolutional neural network model to reconstruct 3D-based spatial structure. We demonstrate that Spa3D is appliable to analyze data from various SRT technology platforms and outperforms state-of-art methods on elucidating 3D-based spatial domains, cell-cell communication, organ-level tempo-spatial development patterns, and 3D spatial trajectory that are not captured by 2D spatial coordinates."
Title: "A Meta-analysis based Hierarchical Variance Model for Powering One and Two-sample t-tests"
When: September 20th, 2024
Jackson Barth
Assistant Professor in the department of Statistical Science at Baylor University
Abstract: "Sample size determination (SSD) is essential in statistical inference and hypothesis testing, as it directly affects the accuracy and power of the analysis. We propose a SSD methodology for one and two-sample t-tests that ensures clinical relevance using a pre-determined unstandardized effect size. Our novel approach leverages Bayesian meta-analysis to account for the uncertainty surrounding the variance, a common issue in SSD. By incorporating prior knowledge from related studies via a Bayesian gamma-inverse gamma model, we obtain an informative posterior predictive distribution for the variance that leads to better decisions about sample size. For efficient posterior sampling, we propose an empirical Bayes approach, which is further combined with a quantile simulation approach to facilitate computation. Simulations and empirical studies demonstrate that our methodology outperforms other aggregate approaches (simple average, weighted average, median) in variance estimation for SSD, especially in meta-analyses with large disparity in sample size and moderate variance. Thus, it offers a robust and practical solution for sample size determination in t-tests."
Title: "Multimodal Large Language Models for Biomedical Applications"
When: September 13th, 2024
Junzhou Huang
Jenkins Garrett Professor in the Computer Science and Engineering department at the University of Texas at Arlington
Abstract: "Biomedical research is increasingly characterized by the availability of vast and diverse data types, ranging from imaging data, genetic sequences, and molecular profiles to clinical texts and patient records. This rich array of biomedical data presents significant opportunities for advancing our understanding of complex biological systems and improving healthcare outcomes. However, the challenge lies in effectively integrating and analyzing these multimodal datasets to extract meaningful insights. Large Language Models (LLMs) have recently emerged as powerful tools capable of processing and understanding diverse data modalities, enabling more comprehensive and accurate insights into biomedical applications. This talk will introduce several recent works that leverage multimodal LLMs to address key challenges across different biomedical domains. Specifically, we will explore the development and application of multimodal LLMs for computational pathology, gene ontology, and computational immunology. These approaches aim to bridge the gap between different data types, enabling more comprehensive and insightful interpretations that can drive new discoveries in biomedical science."
SPRING 2024
Title: " Statistics and the Knowledge Economy"
When: April 19th, 2024
Where: Frances Anne Moody Hall, Southern Methodist University
David Banks
Professor of the Practice of Statistics, Duke University
Abstract: "Statistics came of age when manufacturing was king. But today’s industries are focused on information technology. Remarkably, a lot of our expertise transfers directly. This talk will discuss statistics and AI in the context of computational advertising, autonomous vehicles, large language models, and process optimization."
Title: "Frequency Band Analysis of Nonstationary Multivariate Time Series"
When: April 12th, 2024
Raanju Sundararajan
Assistant Professor in the Department of Statistics and Data Science at Southern Methodist University
Abstract: "Information from frequency bands in biomedical time series provides useful summaries of the observed signal. Many existing methods consider summaries of the time series obtained over a few well-known, pre-defined frequency bands of interest. However, these methods do not provide data-driven methods for identifying frequency bands that optimally summarize frequency-domain information in the time series. A new method to identify partition points in the frequency space of a multivariate locally stationary time series is proposed. These partition points signify changes across frequencies in the time-varying behavior of the signal and provide frequency band summary measures that best preserve the nonstationary dynamics of the observed series. An L_2 norm-based discrepancy measure that finds differences in the time-varying spectral density matrix is constructed, and its asymptotic properties are derived. New nonparametric bootstrap tests are also provided to identify significant frequency partition points and to identify components and cross-components of the spectral matrix exhibiting changes over frequencies. Finite-sample performance of the proposed method is illustrated via simulations. The proposed method is used to develop optimal frequency band summary measures for characterizing time-varying behavior in resting-state electroencephalography (EEG) time series, as well as identifying components and cross-components associated with each frequency partition point."
Title: "Interplay of Linear Algebra, Machine Learning, and High Performance Computing"
When: April 5th, 2024
Xiaoye Sherry Li
Senior Scientist in the Computational Research Division, Lawrence Berkeley National Laboratory
Abstract: "In recent years, we have seen a large body of research using hierarchical matrix algebra to construct low complexity linear solvers and preconditioners. Not only can these fast solvers significantly accelerate the speed of large scale PDE based simulations, but also they can speed up many AI and machine learning algorithms which are often matrix-computation-bound. On the other hand, statistical and machine learning methods can be used to help select best solvers or solvers' configurations for specific problems and computer platforms. In both of these fields, high performance computing becomes an indispensable cross-cutting tool for achieving real-time solutions for big data problems. In this talk, we will show our recent developments in the intersection of these areas. "
Title: "Robust Mendelian Randomization coupled with Alphafold2 for drug target discovery"
When: March 29th, 2024
Zhonghua Liu
Assistant Professor in the Department of Biostatistics at Columbia University
Abstract: "Mendelian randomization (MR) uses genetic variants as instrumental variables (IVs) to infer the causal effect of a modifiable exposure on the outcome of interest by removing unmeasured confounding bias. However, some genetic variants might be invalid IVs due to violations of core IV assumptions. MR analysis with invalid IVs might lead to biased causal effect estimate and misleading scientific conclusions. To address this challenge, we propose a novel MR method that firstSelects valid genetic IVs and then performsPost-selection Inference (MR-SPI) based on two-sample genome-wide summary statistics. We analyze 912 plasma proteins using the large-scale UK Biobank proteomics data in 54,306 participants and identify 7 proteins (TREM2, PILRB, PILRA, EPHA1, CD33, RET, CD55) significantly associated with the risk of Alzheimer’s disease. We employ AlphaFold2 to predict the 3D structural alterations of these 7 proteins due to missense genetic variations, providing new insights into their biological functions in disease etiology. "
Title: "Microbes And Climate Change: Insights From A Grassland Experiment"
When: March 1st, 2024, 12 pm
Where: EES 100
Jizhong Zhou
George Lynn Cross Research Professor in School of Biological Sciences, Director of Institute for the Environmental Genomics, The University of Oklahoma
Abstract: "The acceleration of global climate warming, a consequence of the buildup of atmospheric CO2 and other greenhouse gases due to fossil fuel combustion and land use change, represents one of the greatest scientific and policy concerns in the 21st century. Understanding the mechanisms of biospheric feedbacks to climate change is critical to project future climate warming. Although microorganisms catalyze most of biosphere processes related to fluxes of greenhouse gases, the roles of microorganisms in regulating future climate change remain elusive. With time-series data from a long-term climate change experiment at Oklahoma, our results showed that microorganisms play central roles in regulating soil carbon dynamics through three primary feedback mechanisms, climate warming stimulates microbial temporal turnovers and divergent succession, enhances network complexity and stability, but reduces microbial diversity. Our results also demonstrated that incorporating microbial community information significantly improve the predictability of global change models. All these results have important implications in modeling and predicting future climate change, as well as for policy-making. "
Title: "Genetic prediction of disease risk across populations"
When: February 23th, 2024
Hongyu Zhao
The Ira V. Hiscock Professor of Biostatistics at School of Public Health, Yale University
Abstract:"Polygenic risk score (PRS) has demonstrated its great utility in biomedical research through identifying high risk individuals for different diseases from their genotypes. However, the broader application of PRS to the general population is hindered by the limited transferability of PRS developed in Europeans to non-European populations. To improve PRS prediction accuracy in non-European populations, we develop Bayesian methods that can effectively integrate genome wide association study summary statistics from different populations. Our methods automatically adjust for linkage disequilibrium differences between populations, and characterize the joint distribution of the effect sizes of a variant in different populations to be both null, population specific or shared with correlation. Through simulations and applications to real traits, we show that our methods improve the prediction performance over existing methods in non-European populations. "
Title: "Bridging the Gap: From AI Research to Clinical Application in Medicine"
Abstract: "This talk provides an extensive overview of the challenges and potential solutions in implementing AI in clinical medicine. It traces the transition from research to real-world application, focusing on critical aspects such as model generalizability, commissioning, and performance deterioration over time, etc. The presentation also addresses the integration of AI into existing medical workflows and the adaptation to different physician styles, highlighting the importance of a system-centric approach. Emphasizing the need for real-world adaptability and clinician engagement, this talk aims to shed light on developing AI tools that are not only technologically advanced but also seamlessly integrated and functional in diverse clinical settings."
When: February 2nd, 2024
Steve B. Jiang
Barbara Crittenden Professor in Medical Artificial Intelligence and Automation Lab and Department of Radiation Oncology, University of Texas Southwestern Medical Center
Title: "BART: The Remarkable Flexibility of a Bayesian Ensemble of Trees"
Abstract: "Motivated by ensemble methods in general, and gradient boosting in particular, BART (Bayesian Additive Regression Trees) is a Bayesian nonparametric regression approach for the discovery of the underlying relationship between Y and a multidimensional vector x. Approximating the conditional mean E[Y|x] with a sum of regression trees, BART is built on a statistical model: a likelihood combined with a prior that regularizes the trees to be dimensionally adaptive weak learners. Fitting and inference are accomplished with rapidly mixing Bayesian backfitting MCMC algorithms that enable full posterior inference, including point and interval estimates of E[Y|x], as well as model-free variable selection. To further illustrate the modeling flexibility of a Bayesian ensemble of trees, we also consider two BART elaborations: MBART and HBART. Exploiting potential monotonicities of E[Y|x], MBART incorporates a basis of multivariate monotone trees, thereby enabling the discovery and estimation of decompositions of the directions of E[Y|x] into their unique monotone increasing and decreasing components. To detect and mitigate the possible presence of heteroscedasticity, HBART incorporates an additional product-of-trees model component for the conditional variance, thereby enabling simultaneous inference about both E[Y|x] and Var[Y|x]. (This is joint research with H. Chipman, M. Pratola, R. McCulloch and T. Shively)."
When: January 26th, 2024
Edward I. George
Universal Furniture Professor Emeritus of Statistics and Data Science, The Wharton School University of Pennsylvania
FALL 2023
Title: "pan-MHC and cross-Species Prediction of T Cell Receptor-Antigen Binding with pMTnet-omni"
When: December 1st, 2023
Tao Wang
Associate Professor, UT Southwestern Medical Center
Abstract: "Profiling the binding of T cell receptor (TCR) of T cells towards antigenic peptides presented by MHC proteins is one of the most important unsolved problems in modern immunology. Traditional experimental methods to probe TCR-antigen interactions are slow, labor-intensive, costly, and low- to middle-throughput. To address this problem, we developed pMTnet-omni, an Artificial Intelligence (AI) system based on hybrid protein sequence and structure information, to predict the pairing of TCRs of αβ T cells with peptide-MHC complexes (pMHCs). pMTnet-omni is capable of handling peptides presented by both class I and II pMHCs, and capable of handling both human and mouse TCR-pMHC pairs. pMTnet-omni achieves a high overall Area Under the Curve of Receiver Operator Characteristics (AUROC) of 0.89, which surpasses competing tools by a large margin. We showed that pMTnet-omni can distinguish binding affinity of TCRs with similar sequences. Across a range of datasets from various biological contexts, pMTnet-omni characterized the longitudinal evolution and spatial heterogeneity of TCR-pMHC interactions and their functional impact. We successfully built a biomarker based on pMTnet-omni for predicting immune-related adverse events of immune checkpoint inhibitor (ICI) treatment in a cohort of 57 ICI-treated patients. pMTnet-omni represents a large step closer to a clinically usable AI system for TCR-pMHC pairing prediction that can aid the design and implementation of TCR-based immunotherapeutics."
Title: "Recasting Computer Science Problems in Data Science"
When: November 10th, 2023
Suku Nair
Vice Provost for Research & Chief Innovation Officer Director, AT&T Center for Virtualization, Southern Methodist University
Abstract: "One of my favorite quotes in the cyber world is “In this day and age, either you are going to touch a computer or you are going to be touched by it”. A corollary in the data world is “Either you are going to produce data or you are going to consume it”. There lies the intertwined worlds of computer science and data science. However, in spite of obvious overlaps, fundamental objectives and methodologies in computer science and data science differ significantly. Computer science is built on universal principles to study and solve problems in computation, algorithms, and abstraction. The common approach is to create general-purpose solutions through algorithmic design(s). In contrast, data science is more empirical in nature, founded on data-driven exploration, statistical analysis, and machine learning for understanding and solving problems. With the availability of large amounts of data in most domains and tremendous advancements in computing capabilities, several problems in computer science have been recast as data science problems for more efficient solutions. In this talk we will present data- driven solution transitions for some of the problems in our research domains including Cyber Security, System Reliability, Computer and telecom networks, and Human Machine Interfaces."
Title: "Doubly Flexible Estimation under Label Shift"
When: November 17th, 2023
Yanyuan Ma
Professor, Pennsylvania State University
Abstract: "In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y , (b) the regression model of Y given X in population P, and(c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be mis-specified whereas our proposal here can allow both (b) and (c) to be mis-specified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y -data in population Q. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database."
Title: "Omics and data science in public health"
When: October 20th, 2023, 12pm-1pm
Where: EES 100
Andrea Baccarelli MD, PhD
Leon Hess Professor and Chair, Department of Environmental Health Sciences Columbia University
Abstract: "Dr. Baccarelli will present methods and results from recent and ongoing studies using molecular biology approaches coupled to data science to identify individuals that are more impacted or susceptible to harmful environmental exposures. He will introduce a cadre of methods ranging from epigenome-wide DNA methylation to exosome/extracellular vesicles as promising new paths to enhance understanding of the effects of environmental exposures on human health. Over the past few years, the application of contemporary machine learning methods to epigenomics, specifically to DNA methylation data, has shown that DNA methylation can provide accurate fingerprints of environmental factors, including tobacco smoking, environmental chemicals, and lifestyle. Those fingerprints reflect current exposure, but they also correlate well with past and cumulative exposure. Many investigators have compared the epigenome to a recording device built in our cells that captures both external and internal conditions. Using this framework provides untapped opportunities to identify the impact of risk factors at the individual level, as well as new approaches for risk stratification and personalized prevention. In this presentation, Dr. Baccarelli will review current evidence from recent studies and potential contributions to human health and disease. He will discuss data sources, methodological challenges for large human studies, limitations, and possible future directions."
Title: "Cancer prognosis analysis via integrating molecular and histopathological imaging features"
Abstract: "Modeling cancer prognosis is a “classic” yet still challenging problem. In the past two decades, high- throughput molecular data have been extensively used in such analysis. Very recently, it has been shown that histopathological imaging features, which are generated in the biopsy process, are also informative for modeling prognosis (and other outcomes/phenotypes). Molecular and imaging data contain overlapping as well as independent information. In our recent studies, we have developed regularization techniques, testing the degree of independent information for prognosis and integrating the two distinct types of data for prognosis modeling under homogeneity as well as heterogeneity."
When: September 22nd, 2023
Shuangge Ma, Ph.D.
Professor and Chair, Biostatistics Department, Yale University School of Public Health