## January 19, 2022

### Bailey Fosdick, Associate Professor of Statistics at Colorado State University

Click here to view a recording of this seminar

Title: Modeling Infection Fatality Rates to Assess the Burden of COVID-19 in Developing Countries

Abstract: COVID-19 spread quickly around the world after first being discovered in China in late 2019. It has had devastating impacts, however its impacts, both in terms of infection prevalence and fatalities, have been non-uniformly distributed worldwide. While early studies focused on COVID-19 infection and fatality rates in high-income countries, little attention has been given to the impacts of COVID-19 in developing countries. In this work, we systematically reviewed the literature to identify all COVID-19 serology studies conducted by early 2021 using population representative samples. We developed a Bayesian hierarchical model for simultaneously modeling serology and death data to make inference on age-specific infection fatality rates. This model directly accounts for conventional sampling uncertainty as well as uncertainty about the serological test assay sensitivity and specificity. Through a careful analysis of data from over thirty developing countries, we found seroprevalence in many developing country locations was markedly higher than in high-income countries and age-specific infection fatality rates were roughly twice as high as that in high-income countries.

## January 26, 2022

### Yanxun Xu, Assistant Professor of Applied Mathematics and Statistics at Johns Hopkins University

Click here to view a recording of this seminar

Title: A Bayesian Reinforcement Learning Framework for Optimizing Sequential Combination Antiretroviral Therapy in People with HIV

Abstract: Numerous adverse effects (e.g., depression) have been reported for combination antiretroviral therapy (cART) despite its remarkable success on viral suppression in people with HIV (PWH). To improve long-term health outcomes for PWH, there is an urgent need to design personalized optimal cART with the lowest risk of comorbidity in the emerging field of precision medicine for HIV. Large-scale HIV studies offer researchers unprecedented opportunities to optimize personalized cART in a data-driven manner. However, the large number of possible drug combinations for cART makes the estimation of cART effects a high-dimensional combinatorial problem, imposing challenges in both statistical inference and decision-making. We develop a two-step Bayesian decision framework for optimizing sequential cART assignments. In the first step, we propose a dynamic model for individuals’ longitudinal observations using a multivariate Gaussian process. In the second step, we build a probabilistic generative model for cART assignments and design an uncertainty-penalized policy optimization using the uncertainty quantification from the first step. Applying the proposed method to a dataset from the Women’s Interagency HIV Study, we demonstrate its clinical utility in assisting physicians to make effective treatment decisions, serving the purpose of both viral suppression and comorbidity risk reduction.

## February 2, 2022

### Mengyang Gu, Assistant Professor of Statistics and Applied Probability at University of California, Santa Barbara

Click here to view a recording of this seminar

Title: Scalable marginalization of latent variables for correlated data—the legacy of Rudolf Kalman

Abstract: Computer models, such as numerical solutions of differential equations are widely used in forward predictions, whereas the computing cost prohibits the use of computer experiments for some large-scale systems. In this talk, we will discuss the Gaussian process emulator for approximating computer models with massive coordinates, high-dimensional inputs and functionals. Applications include emulating the TITAN2D model of pyroclastic flows, ground deformation simulation by COMSOL Multiphysics, and ab initio molecular dynamics simulations by density functional theory. For Gaussian processes with large observations, we will discuss marginalization of latent states. As an example, we will introduce stochastic differential equation representation of Gaussian process with Matern covariance for 1D input, and compute it by Kalman filter in linear operations with respect to the number of observations, as an exact, computationally efficient alternative. We further introduce the generalized probabilistic principal component analysis, for matrix observations (such as images with missing values) based on SDE representation for massive observations. If time permits, we will briefly talk about inverse problems based on observations such as microscopic videos and satellite radar interferograms.

## February 9, 2022

### Yuexiao Dong, Associate Professor of Statistical Science at Temple University

Title: Testing the linear mean and constant variance conditions in sufficient dimension reduction

Abstract: Sufficient dimension reduction (SDR) methods characterize the relationship between the response and the predictors through a few linear combinations of the predictors. Sliced inverse regression and sliced average variance estimation are among the most popular SDR methods as they do not involve multi-dimensional smoothing and are easy to implement. However, these inverse regression-based methods require the linear conditional mean (LCM) and(or) the constant conditional variance (CCV) assumption. We propose novel tests to check the validity of the LCM and the CCV conditions through the martingale difference divergence. Extensive simulation studies and a real data application are performed to demonstrate the effectiveness of our proposed tests.

## February 16, 2022

### Natallia Katenka, Associate Professor of Computer Science and Statistics at University of Rhode Island

Click here to view a recording of this seminar

Title: Estimating causal effects of non-randomized HIV prevention interventions with spillover in network-based studies among people who inject drugs

Abstract: Evaluating causal effects in the presence of interference is challenging in network-based studies of hard to reach populations. Like many such populations, people who inject drugs (PWID) are embedded in social networks and often exert influence on others in their network. In our setting, the study design is observational with a non-randomized network-based HIV prevention intervention. The information is available on each participant and their connections that confer possible shared HIV risk behaviors through injection and sexual risk behaviors. We consider two inverse probability weighted (IPW) estimators to quantify the population-level effects of non-randomized interventions on subsequent health outcomes. We demonstrated that these two IPW estimators are consistent, asymptotically normal, and derived a closed form estimator for the asymptotic variance, while allowing for overlapping interference sets (groups of individuals in which the interference is assumed possible). A simulation study was conducted to evaluate the finite-sample performance of the estimators. We analyzed data from the Transmission Reduction Intervention Project, which ascertained a network of PWID and their contacts in Athens, Greece, from 2013 to 2015. We evaluated the effects of community alerts on HIV risk behavior in this observed network, where the links between participants were defined by using substances or having unprotected sex together. In the study, community alerts were distributed to inform people of recent HIV infections among individuals in close proximity in the observed network. The estimates of the risk differences for both IPW estimators demonstrated a protective effect. The results suggest that HIV risk behavior can be mitigated by exposure to a community alert when an increased risk of HIV is detected in the network.

## February 23, 2022

### Weining Shen, Associate Professor of Statistics at University of California, Irvine

Click here to view a recording of this seminar

Title: Covariance estimation for matrix data analysis

Abstract: Matrix-valued data has received an increasing interest in applications such as neuroscience, environmental studies and sports analytics. In this talk, I will discuss a recent project on estimating the covariance of matrix data. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, I will introduce a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Computational algorithms, theoretical results, and applications will be discussed.

## March 2, 2022

### Miles Lopes, Associate Professor of Statistics at University of California, Davis

Click here to view a recording of this seminar

Title: Rates of Approximation for CLT and Bootstrap in High Dimensions.

Abstract: In the setting of low-dimensional data, it is well known that the distribution of a sample mean can be consistently approximated using the CLT or bootstrap methods. Also, classical versions of the Berry-Esseen theorem show that such approximations can achieve a rate of order n^{-1/2}, where accuracy is measured with respect to the "Kolmogorov distance". However, until recently, it was an open problem to determine if Berry-Esseen type bounds with near n^{-1/2} rates can be established in the context of high-dimensional data---which stimulated many advances in the literature during the last several years. ln this talk, I will survey these developments and discuss some of my own recent work on this problem. The relevant papers for the talk are available at the following links: https://arxiv.org/abs/2009.06004 and https://doi.org/10.1214/19-AOS1844.

## March 16, 2022

### Qiyang Han, Assistant Professor of Statistics at Rutgers University

Click here to view a recording of this seminar

Title: High dimensional asymptotics of likelihood ratio tests in the Gaussian sequence model under convex constraints

Abstract: In the Gaussian sequence model $Y=\mu+\xi$, we study the likelihood ratio test (LRT) for testing $H_0: \mu=\mu_0$ versus $H_1: \mu \in K$, where $\mu_0 \in K$, and $K$ is a closed convex set in $\R^n$. In particular, we show that under the null hypothesis, normal approximation holds for the log-likelihood ratio statistic for a general pair $(\mu_0,K)$, in the high dimensional regime where the estimation error of the associated least squares estimator diverges in an appropriate sense. The normal approximation further leads to a precise characterization of the power behavior of the LRT in the high dimensional regime. These characterizations show that the power behavior of the LRT is in general non-uniform with respect to the Euclidean metric, and illustrate the conservative nature of existing minimax optimality and sub-optimality results for the LRT. A variety of examples, including testing in the orthant/circular cone, isotonic regression, Lasso, and testing parametric assumptions versus shape-constrained alternatives, are worked out to demonstrate the versatility of the developed theory.

This talk is based on joint work with Bodhisattva Sen(Columbia) and Yandi Shen(Chicago).

## March 23, 2022

### Hui Zou, Professor of Statistics at University of Minnesota

Click here to view a recording of this seminar

Title: Sparse Convoluted Rank Regression in High Dimensions

Abstract: Wang et al. (2020, JASA) studied the high-dimensional sparse penalized rank regression and established its nice theoretical properties. Compared with the least squares, rank regression can have a substantial gain in estimation efficiency while maintaining a minimal relative efficiency of $86.4\%$. However, the computation of penalized rank regression can be very challenging for high-dimensional data, due to the highly nonsmooth rank regression loss. In this work we view the rank regression loss as a non-smooth empirical counterpart of a population level quantity, and a smooth empirical counterpart is derived by substituting a kernel density estimator for the true distribution in the expectation calculation. This view leads to the convoluted rank regression loss and consequently the sparse penalized convoluted rank regression (CRR) for high-dimensional data. Under the same key assumptions for sparse rank regression, we establish the rate of convergence of the $\ell_1$-penalized CRR for a tuning free penalization parameter and prove the strong oracle property of the folded concave penalized CRR. We further propose a high-dimensional Bayesian information criterion for selecting the penalization parameter in folded concave penalized CRR and prove its selection consistency. We derive an efficient algorithm for solving sparse convoluted rank regression that scales well with high dimensions. Numerical examples demonstrate the promising performance of the sparse convoluted rank regression over the sparse rank regression. Our theoretical and numerical results suggest that sparse convoluted rank regression enjoys the best of both sparse least squares regression and sparse rank regression.

## March 30, 2022

### Miaoyan Wang, Assistant Professor of Statistics at University of Wisconsin

Click here to view a recording of this seminar

Title: Nonparametric Tensor Completion via Sign Series

Abstract: Higher-order tensors arise frequently in applications such as neuroimaging, recommendation system, social network analysis, and psychological studies. We consider the problem of tensor estimation from noisy observations with possibly missing entries. A nonparametric approach to tensor completion is developed based on a new model which we coin as sign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low- and high-rank signals, while encompassing many existing tensor models – including CP models, Tucker models, single index models, several hypergraphon models – as special cases. We show that the sign tensor series is theoretically characterized, and computationally estimable, via classification tasks with carefully-specified weights. Excess risk bounds, estimation error rates, and sample complexities are established. We demonstrate the outperformance of our approach over previous methods on two datasets, one on human brain connectivity networks and the other on NeurIPS topic data mining.

## April 6, 2022

### Kris Sankaran, Assistant Professor of Statistics at University of Wisconsin-Madison

Click here to view a recording of this seminar

Title: Revisiting Iterative Data Structuration: Alignment, Refinement, and Simulation

Abstract: There are many sources of variation in biological data — in a typical study, a statistician may be expected to sort through real variation across subjects, timepoints, environments, and sequencing technologies, not to mention nuisance variation due to batch effects and technical artifacts. To manage such analysis, it is helpful to break the task into many, more manageable pieces, beginning with simple, transparent models without losing sight of the need for models faithful to reality, as complex as it may be.

With this in mind, this talk revisits the idea of iterative data structuration [1, 2] from the lens of modern data visualization and generative modeling. We dive deeply into the question of choosing K in Latent Dirichlet Allocation, a popular dimensionality reduction strategy for count data in ‘omics. We develop an approach to “topic alignment,” which makes it easy to jump across models with differing K in a way that parallels hierarchical clustering [3]. Our construction suggests several natural diagnostics for quantifying topic quality; these are evaluated in a simulation study. We also demonstrate the use of an accompanying R package, alto (https://lasy.github.io/alto), for analysis of a vaginal microbiome dataset.

We then transition to a series of vignettes on how coupling data visualization and generation can support navigation across the space of models. We give examples from experimental design, mixture modeling, and agent-based simulation, highlighting the value of iterative data structuration as problem solving device.

[1] Holmes, Susan. "Comment on “A Model for Studying Display Methods of Statistical Graphics”." Journal of Computational and Graphical Statistics 2.4 (1993): 349-353. [2] Mallows, Colin L., and John W. Tukey. "An overview of techniques of data analysis, emphasizing its exploratory aspects." Some recent advances in statistics 33 (1982): 111-172. [3] Fukuyama, Julia, Kris Sankaran, and Laura Symul. "Multiscale Analysis of Count Data through Topic Alignment." arXiv preprint arXiv:2109.05541 (2021).

## April 13, 2022 (in-person and online)

### Marc Richards and Wei Peng, NFL Big Data Bowl finalists

Click here to view a recording of this seminar

Title: Applications of Player Tracking Data in the National Football League

Abstract: The National Football League in partnership with Kaggle hosts an annual competition called the Big Data Bowl that tasks individuals with answering questions using player tracking data. This player tracking data is a rich spatial-temporal dataset that captures the location, orientation, and direction for each player on the field and the football at every tenth of a second across all plays. Armed with these data, participants are tasked with answering questions such as “What makes a good defense?” or “How can we better evaluate special teams?”. In this talk, we will discuss this rich dataset and the questions one can answer with it. We will then discuss two applications of these data: (i) exploring new ways to evaluate individual defensive performance in pass coverage, and (ii) identifying optimal aiming locations for punters.

## April 20th, 2022

### Cencheng Shen, Assistant Professor of Applied Economics and Statistics at the University of Delaware

Click here to view a recording of this seminar

Abstract: Graph data is high-dimensional and structured, which often requires proper dimension reduction prior to subsequent inference. In this talk, I will introduce a new method called graph encoder embedding.

Comparing to existing approaches, the encoder embedding is extremely fast and scalable, easy to visualize and interprete, and asymptotically consistent under popular random graph models. I will illustrate its properties via simulations and data applications on vertex classification and clustering.