# Fall 2021 Seminars

### David Choi, Assistant Professor of Statistics and Information Systems at Heinz College, Carnegie Mellon University

Title: Causal Inference for Randomized Experiments in Social Networks

Abstract: In experiments that study social phenomena, such as peer influence or herd immunity, the treatment of one unit may influence the outcomes of others. Such interference between units'' violates traditional approaches for causal inference, so that additional assumptions are usually required to model the underlying social mechanism. We propose an approach that requires no such assumptions, allowing for social effects that are both unmodeled and strong, with confidence intervals found using only the randomization of treatment. Regression, matching, or weighting may be applied, as may best fit the application at hand. Inference is done by bounding the distribution of the estimation error over all possible values of the unknown counterfactual, resulting in an Ising Blockmodel problem that we solve or bound by integer program. Examples are shown using a vaccine trial and two experiments investigating social influence.

## September 15, 2021

### Joshua Snoke, Associate Statistician with RAND Corporation

Title: The Statistical Underpinnings of Modern Data Privacy and Confidentiality

Abstract: Significant effort goes into collecting and utilizing social, economic, health, and technological data for research purposes, but often the ability to access or share these data is limited because they contain sensitive information concerning the identities or attributes of those in the data. In this talk, I will cover topics from the fields of statistical disclosure control (SDC) and formal privacy, which involve methods to alter the confidential data with the goal of minimizing the risk of disclosure while preserving the statistical accuracy of the data. I will provide examples from some of my methodological and applied work in areas of Differential Privacy and Synthetic Data and more broadly review the many statistical underpinnings of this field. I will also offer practical considerations for how using altered data products for statistical estimates can affect the results, specifically concerning bias and inequality.

## September 22, 2021

### Zach Branson, Assistant Teaching Professor of Statistics at Carnegie Mellon University

Title: Randomization Tests for Assessing Covariate Balance When Designing and Analyzing Matched Datasets

Abstract: Causal analyses for observational studies are often complicated by covariate imbalances among treatment groups, and matching methodologies alleviate this complication by finding subsets of treatment groups that exhibit covariate balance. It is widely agreed upon that covariate balance can serve as evidence that a matched dataset approximates a randomized experiment, but what kind of experiment does it approximate? In this talk, I will present a randomization test for the hypothesis that a matched dataset approximates a particular experimental design, such as complete randomization, block randomization, or rerandomization. The test can incorporate any experimental design and allows for a graphical display that puts several designs on the same univariate scale, thereby allowing researchers to pinpoint which design, if any, is most appropriate for a matched dataset. After researchers determine a plausible design, we recommend a randomization-based analytical approach, which can incorporate any design and treatment effect estimator. In simulations and applications, I’ve found that this test can frequently detect violations of randomized assignment that harm inferential results, and also that matched datasets with high levels of balance tend to approximate balance-constrained designs like rerandomization, thereby allowing for precise causal analyses. However, assuming a precise design should be proceeded with caution, because it can harm inferential results if there are still substantial biases due to remaining imbalances after matching. Although I focus on matching, I also demonstrate how to use randomization tests to assess covariate balance in instrumental variable analyses and regression discontinuity designs. All of these tools can be implemented in my R package randChecks.

## September 29, 2021

### Yiru Wang, Assistant Professor of Economics at University of Pittsburgh

Title: A Unified Approach to Estimating Panel Autoregressive Models with Latent Group Structures (joint with Wenxin Huang)

Abstract: This paper considers a unified estimation and inference concerning the panel autoregressive (AR) coefficient. The degree of persistence is unknown for each time series, but the AR coefficient is assumed to contain a latent group structure. We propose a penalized weighted least square approach, a modified version of Su et al. (2016) classifier-Lasso (C-Lasso), to simultaneously identify the group membership and consistently estimate the AR coefficient, regardless of whether the underlying AR process is stationary, unit root, near-integrated, or even explosive. Theoretically, we demonstrate the classification consistency and the oracle properties of the proposed Lasso-type estimators. Monte-Carlo simulations demonstrate good finite- sample performance of the proposed approach in both classification and estimation.

## October 6, 2021

### Fangzheng Xie, Assistant Professor of Statistics at Indiana University

Title: Central limit theorems for spectral estimators and their one-step refinement for sparse random graphs

Abstract: Sparse random graph models have been a heated topic in statistics and machine learning, as well as a broad range of application areas. In this talk, I will establish the central limit theorems for the spectral estimators and their one-step refinement for low-rank random graphs. Spectral estimators, including the adjacency spectral embedding and the Laplacian spectral embedding, are proven to be sub-optimal and can be improved by their respective one-step refinement. These results are built upon a collection of central limit theorems for spectral methods and the one-step refinement. Simulation examples and the analysis of a real-world Wikipedia graph dataset are provided to demonstrate the usefulness of the proposed methods.

The talk is based on the following papers:

Xie F, Xu Y. Efficient estimation for random dot product graphs via a one-step procedure. Journal of the American Statistical Association, accepted for publication, 2021
Xie F. Entrywise limit theorems of eigenvectors and their one-step refinement for sparse random graphs. arXiv preprint arXiv:2106.09840, 2021.

## October 13, 2021 - via Zoom

### Emily Hector, Assistant Professor of Statistics at North Carolina State University

Title: Data integration meets divide-and-conquer: dealing with heterogeneity and dependence in big data

Abstract: Divide-and-conquer has become a routine algorithmic choice to deal with big data, but statistical theory for complex settings remains underdeveloped. We propose a framework to regress high-dimensional dependent outcomes on covariates in a statistically and computationally efficient way. Two primary challenges arise due to dependence and heterogeneity between outcomes. To address these challenges, we first assume the heterogeneity structure of the outcomes is known, and develop a data integration procedure for estimation and inference that is implemented in a divide-and-conquer computational scheme. We then further extend this approach to learn the heterogeneity structure of the outcomes when it is unknown. The approach is based on efficient estimation of sub-response specific regression parameters using quadratic inference functions, and joint re-estimation of parameters following some rule of heterogeneity and dependence. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We consider two applications to illustrate the proposed methodology: the analysis of the association between smoking and metabolites in a large multi-cohort study when the heterogeneity structure is known, and image-on-scalar regression in a large multi-cohort neuroimaging study when the heterogeneity structure is unknown.

## October 20, 2021

### Subhadeep Paul, Assistant Professor of Statistics and Core Professor in Translational Data Analytics Institute at Ohio State University

Title: Modeling continuous-time networks of relational events

Abstract:  Spatiotemporal data with complex network dependencies are increasingly available in many application problems involving human mobility, social media, disease transmission, and international relationships. In many such application settings, the observed data consist of timestamped relational events. For example, in social media, users interact with each other through events that occur at specific time instances such as liking, mentioning, commenting, or sharing another user's content. In international relations and conflicts, nations commit acts of hostility or disputes through discrete time-stamped events. I will introduce statistical models and methods for analyzing such datasets combining tools from network analysis and multivariate point processes. I will also describe scalable estimation methods and study the asymptotic properties of the estimators. Finally, I will demonstrate the models are able to fit several real datasets well and predict temporal motif structures in those datasets.

## October 27, 2021 - via Zoom

### Somak Dutta, Associate Professor of Statistics at Iowa State University

Title: On lattice-based approximation of fractional Gaussian fields

Abstract: Fractional Gaussian fields provide a rich class of spatial models and have a long history of applications in multiple branches of science. However, estimation and inference for fractional Gaussian fields present significant computational challenges. In this talk, we investigate the use of fractional Laplacian differencing on regular lattices to approximate continuum fractional Gaussian fields. We show that regular lattice approximations facilitate fast matrix-free computations and enable anisotropic representations and demonstrate that there is considerable agreement between the continuum models and their lattice approximations for a range of the fractional parameter. Thus, the parameter estimates and inferences about the continuum fractional Gaussian fields can be derived from the lattice approximations. We also develop matrix-free exact conditional simulations for the lattice-based model and illustrate our methods on surface temperature on the Indian Ocean from the argo floats project and ground water arsenic mapping in Bangladesh. This talk is based on several joint works with Dr. Debashis Mondal.

## November 3, 2021 - via Zoom

### James D. Wilson, Assistant Professor of Psychiatry and Biostatistics, Director of Experimental Design and Data Analysis for the Translational Neuroscience Program at University of Pittsburgh

Title: Network Analysis of the Brain: Statistical Modeling of Functional Connectivity Data

Abstract: Network analysis is one of the prominent multivariate techniques used to study structural and functional connectivity of the brain. In a network model of the brain, vertices are used to represent voxels or regions of the brain, and edges between two nodes represent a physical or functional relationship between the two incident regions. Network investigations of connectivity have produced many important advances in our understanding of brain structure and function, including in domains of aging, learning and memory, cognitive control, emotion, and disease.

Despite their use, network methodologies still face several important challenges. In this talk, I will focus on a particularly important challenge in the analysis of structural and functional connectivity: how does one jointly model the generative mechanisms of structural and functional connectivity with other modalities?  I propose and describe a statistical network model, called the generalized exponential random graph model (GERGM), that flexibly characterizes the network topology of structural and functional connectivity and can readily integrate other modalities of data. The GERGM also directly enables the statistical testing of individual differences through the comparison of their fitted models. In applying the GERGM to the connectivity of healthy individuals from the Human Connectome Project, we find that the GERGM reveals remarkably consistent organizational properties guiding subnetwork architecture in the typically developing brain. We will discuss ongoing work of how to adapt these models to neuroimaging cohorts associated with the ADRC at the University of Pittsburgh, where the goal is to relate the dynamics of structural and functional connectivity with tau and amyloid – beta deposition in individuals across the Alzheimer’s continuum.

## November 17, 2021 - via Zoom

### Chao Gao, Assistant Professor of Statistics at University of Chicago

Title: Exact Minimax Estimation for Phase Synchronization

Abstract (note that LaTeX has been comverted to plain text): We study the phase synchronization problem with measurements $Y=z^*z^{*H}+\sigma W \in \mathbb{C}^{n\times n}$, where $z^*$ is an $n$-dimensional complex unit-modulus vector and $W$ is a complex-valued Gaussian random matrix. It is assumed that each entry $Y_{jk}$ is observed with probability $p$. We prove that the minimax lower bound of estimating $z^*$ under the squared $\ell_2$ loss is $(1-o(1))\frac{\sigma^2}{2p}$. We also show that both generalized power method and maximum likelihood estimator achieve the error bound $(1+o(1))\frac{\sigma^2}{2p}$. Thus, $\frac{\sigma^2}{2p}$ is the exact asymptotic minimax error of the problem. Our upper bound analysis involves a precise characterization of the statistical property of the power iteration. The lower bound is derived through an application of van Trees' inequality.

## December 1st, 2021 - via Zoom

### Walter Dempsey, Assistant Professor of Biostatistics at University of Michigan, School of Public Health

Title: Statistical network modeling via exchangeable interaction processes

Abstract: Many modern network datasets arise from processes of interactions in a population, such as phone calls, e-mail exchanges, co-authorships, social network posts, and professional collaborations. In such interaction networks, the interactions comprise the fundamental statistical units, making a framework for interaction-labeled networks more appropriate for statistical analysis. In this talk, we present exchangeable interaction network models and explore their statistical properties. These models allow for sparsity and power law degree distributions, both of which are widely observed empirical network properties. I will start by presenting the simple Hollywood model, which is computationally tractable, admits a clear interpretation, exhibits good theoretical properties, and performs reasonably well in estimation and prediction.

In many settings, the series of interactions exhibit additional structure.  E-mail exchanges, for example, have a single sender and potentially multiple receivers. User posts on a social network occur over time and potentially exhibit community structure.  I will briefly introduce three extensions that fall within the edge exchangeable framework. In particular, I will introduce extensions of the Hollywood model (1) that partially pools information via a latent, shared population-level distribution to account for hierarchical structure; (2) that accounts for temporal information; and (3) that accounts for latent community structure. Simulation studies and supporting theoretical analyses are presented. Computationally tractable MCMC sampling algorithms are derived. Inferences are shown on the Enron e-mail, ArXiv, and TalkLife (peer support network) datasets.  I will end with a discussion of how to perform posterior predictive checks on interaction data. Using these proposed checks, I will show that the edge exchangeable framework leads to models that fit interaction datasets well.

## December 8th, 2021 - via Zoom

### Keith Levin, Assistant Professor of Statistics at University of Wisconsin, Madison

Title: Averaging Connectomes: Beyond the Arithmetic Mean

Abstract: Data arising from neuroimaging studies often consists of a collection of sample covariance matrices describing the dependence among blood oxygen level signals measured at different locations in the brain. A natural approach to population-level inference based on these samples is to consider the arithmetic mean of these matrices. However, the nature of the data under study suggests that this is a suboptimal choice, as the arithmetic mean fails to account for the structure of the positive definite cone. Even in the absence of covariance structure, the observed matrices may differ in their noise structures owing to subject-level factors (e.g., head movement in the MRI machine), in which case a weighted average is more appropriate. In this talk, we will discuss both of these settings and present alternative choices of matrix averages better suited to them than the arithmetic mean. We will demonstrate some of these techniques in an application to fMRI data.