Title: Variable Selection and Forecasting in High Dimensional Linear Regressions with Structural Breaks
Abstract: This paper is concerned with problem of variable selection and forecasting in the presence of parameter instability. There are a number of approaches proposed for forecasting in the presence of breaks, including the use of rolling windows or expo- nential down-weighting. However, these studies start with a given model specification and do not consider the problem of variable selection. It is clear that, in the absence of breaks, researchers should weigh the observations equally at both variable selec- tion and forecasting stages. In this study, we investigate whether or not we should use weighted observations at the variable selection stage in the presence of structural breaks, particularly when the number of potential covariates is large. Amongst the extant variable selection approaches we focus on the recently developed One Covariate at a time Multiple Testing (OCMT) method that allows a natural distinction between the selection and forecasting stages, and provide theoretical justification for using the full (not down-weighted) sample in the selection stage of OCMT and down-weighting of observations only at the forecasting stage (if needed). The benefits of the proposed method are illustrated by empirical applications to forecasting output growths and stock market returns.
Wednesday, January 27 at 12:30 PM
Title: Testing goodness-of-fit and conditional independence with approximate co-sufficient sampling
Abstract: Goodness-of-fit (GoF) testing is ubiquitous in statistics, with direct ties to model selection, confidence interval construction, conditional independence testing, and multiple testing, just to name a few applications. While testing the GoF of a simple (point) null hypothesis provides an analyst great flexibility in the choice of test statistic while still ensuring validity, most GoF tests for composite null hypotheses are far more constrained, as the test statistic must have a tractable distribution over the entire null model space. A notable exception is co-sufficient sampling (CSS): resampling the data conditional on a sufficient statistic for the null model guarantees valid GoF testing using any test statistic the analyst chooses. But CSS testing requires the null model to have a compact (in an information-theoretic sense) sufficient statistic, which only holds for a very limited class of models; even for a null model as simple as logistic regression, CSS testing is powerless. In this paper, we leverage the concept of approximate sufficiency to generalize CSS testing to essentially any parametric model with an asymptotically-efficient estimator; we call our extension “approximate CSS” (aCSS) testing. We quantify the finite-sample Type I error inflation of aCSS testing and show that it is vanishing under standard maximum likelihood asymptotics, for any choice of test statistic. We apply our proposed procedure both theoretically and in simulation to a number of models of interest to demonstrate its finite-sample Type I error and power. This work is joint with Lucas Janson.
Wednesday, February 3 at 12:30 PM
Title: Functional Linear Regression with Mixed Predictors
Abstract: We study a functional linear regression model that deals with functional responses and allows for both functional covariates and high-dimensional vector covariates. The proposed model is flexible and nests several functional regression models in the literature as special cases. Based on the theory of reproducing kernel Hilbert spaces (RKHS), we propose a penalized least squares estimator that can accommodate functional variables observed on discrete grids. Besides the conventional smoothness penalties, a group Lasso-type penalty is further imposed to induce sparsity in the high-dimensional vector predictors. We derive finite sample theoretical guarantees and show that the excess prediction risk of our estimator is minimax optimal. Furthermore, our analysis reveals an interesting phase transition phenomenon that the optimal excess risk is determined jointly by the smoothness and the sparsity of the functional regression coefficients. A novel efficient optimization algorithm based on iterative coordinate descent is devised to handle the smoothness and sparsity penalties simultaneously. Simulation studies and real data applications illustrate the promising performance of the proposed approach compared to the state-of-the-art methods in the literature.
Wednesday, February 10 at 12:30 PM
Title: A New Robust and Powerful Weighted Logrank Test
Abstract: In the weighted logrank tests such as Fleming-Harrington test and the Tarone-Ware test, certain weights are used to put more weight on early, middle or late events. The purpose is to maximize the power of the test. The optimal weight under an alternative depends on the true hazard functions of the groups being compared, and thus cannot be applied directly. We propose replacing the true hazard functions with their estimates and then using the estimated weights in a weighted logrank test. However, the resulting test does not control type I error correctly because the weights converge to 0 under the null in large samples. We then adjust the estimated optimal weights for correct type I error control while the resulting test still achieves improved power compared to existing weighted logrank tests, and it is shown to be robust in various scenarios. Extensive simulation is carried out to assess the proposed method and it is applied in several clinical studies in lung cancer.
Wednesday, February 17 at 12:30 PM
Title: Optimal Ranking Recovery from Pairwise Comparisons
Abstract: Ranking from pairwise comparisons is a central problem in a wide range of learning and social contexts. Researchers in various disciplines have made significant methodological and theoretical contributions to it. However, many fundamental statistical properties remain unclear especially for the recovery of ranking structure. This talk presents two recent projects towards optimal ranking recovery, under the Bradley-Terry-Luce (BTL) model.
In the first project, we study the problem of top-k ranking. That is, to optimally identify the set of top-k players. We derive the minimax rate and show that it can be achieved by MLE. On the other hand, we show another popular algorithm, the spectral method, is in general suboptimal. It turns out the leading constants of the sample complexity are different for the two algorithms.
In the second project, we study the problem of full ranking among all players. The minimax rate exhibits a transition between an exponential rate and a polynomial rate depending on the magnitude of the signal-to-noise ratio of the problem. To the best of our knowledge, this phenomenon is unique to full ranking and has not been seen in any other statistical estimation problem. A divide-and-conquer ranking algorithm is proposed to achieve the minimax rate.
Wednesday, February 24 at 12:30 PM
Title: Functional Models for Time Varying Random Objects
Abstract: In recent years, samples of time-varying object data such as time-varying networks that are not in a vector space have been increasingly collected. These data can be viewed as elements of a general metric space that lacks local or global linear structure and therefore common approaches that have been used with great success for the analysis of functional data, such as functional principal component analysis, cannot be applied directly.
In this talk, I will propose some recent advances along this direction. First, I will discuss ways to obtain dominant modes of variations in time varying object data. I will describe metric covariance, a novel association measure for paired object data lying in a metric space (\Omega d) that we use to define a metric auto-covariance function for a sample of random \Omega -valued curves, where \Omega will not have a vector space or manifold structure. The proposed metric auto-covariance function is non-negative definite when the squared metric d^2 is of negative type. Then the eigenfunctions of the linear operator with the auto-covariance function as kernel can be used as building blocks for an object functional principal component analysis for \Omega-valued functional data, including time-varying probability distributions, covariance matrices and time-dynamic networks. Then I will describe how to obtain analogues of functional principal components for time-varying objects by applying Fréchet means and projections of distance functions of the random object trajectories in the directions of the eigenfunctions, leading to real-valued Fréchet scores and object valued Fréchet integrals. This talk is based on joint work with Hans-Georg Müller.
Wednesday, March 3 at 12:30 PM
Title: Time-varying models and applications
Abstract: In this talk I will discuss several time-varying models and their applications. A major motivation of such models emanate from the field of econometrics but these are also very prevalent in several other areas such as medical sciences, climatology etc. Whenever a time-series dataset is observed over a large period of time, it is natural to assume the coefficient parameters also vary over time.
We start with some of the inferential results and analysis using the time-varying analogue of the popular ARCH GARCH model in a frequentist set-up. A criticism of kernel-based estimation lies in the fact that they need huge sample size for reasonable coverage. However, it is important to note that very little has been done so far in the corresponding Bayesian regime with non-gaussian dependent data. One of the key reasons for this lack was the challenge to establish a suitable posterior contraction rate when the independence is taken away.
Here, in the bayesian front, I will talk about two recent works where we deal with Poisson ARX type models and ARCH GARCH type time-varying models. Our estimations are B-spline based and we establish optimal contraction of the posterior computed via Hamiltonian Monte Carlo. We conclude the talk by discussing two applications: a. the Covid-19 spread in NYC through the tvPoisson model and b. Predictive performance comparison of the Bayesian and frequentist tv(G)ARCHmodel applied on some real datasets.
Wednesday, March 17 at 12:30 PM
Kamel Lahouel, Postdoctoral Research Fellow in Biostatistics at Johns Hopkins University
Title: Revisiting the tumorigenesis timeline with a data-driven generative model
Abstract: Cancer is driven by the sequential accumulation of genetic and epigenetic changes in oncogenes and tumor suppressor genes. The timing of these events is not well understood. Moreover, it is currently unknown why the same driver gene change appears as an early event in some cancer types and as a later event, or not at all, in others. These questions have become even more topical with the recent progress brought by genome-wide sequencing studies of cancer. Focusing on mutational events, we provide a mathematical model of the full process of tumor evolution that includes different types of fitness advantages for driver genes and carrying-capacity considerations. The model is able to recapitulate a substantial proportion of the observed cancer incidence in several cancer types (colorectal, pancreatic, and leukemia) and inherited conditions (Lynch and familial adenomatous polyposis), by changing only 2 tissue-specific parameters: the number of stem cells in a tissue and its cell division frequency. The model sheds light on the evolutionary dynamics of cancer by suggesting a generalized early onset of tumorigenesis followed by slow mutational waves, in contrast to previous conclusions. Formulas and estimates are provided for the fitness increases induced by driver mutations, often much larger than previously described, and highly tissue dependent. Our results suggest a mechanistic explanation for why the selective fitness advantage introduced by specific driver genes is tissue dependent.
Wednesday, March 31 at 12:30 PM
Title: Standardized partial sums and products of p-values
Abstract: In meta-analysis, a wide range of methods for combining multiple p-values have been applied throughout the scientific literature. For sparse signals where only a small proportion of the p-values are truly significant, a technique called higher criticism has been shown to have asymptotic consistency and more power than Fisher’s original method. However, higher criticism and other related methods can still lack power. Three simple-to-compute statistics are proposed here for detecting sparse signals, based on standardizing partial sums or products of p-value order statistics. The use of standardization is theoretically justified with results demonstrating asymptotic normality, and avoids the computational difficulties encountered when working with analytic forms of the distributions of the partial sums and products. In particular, the standardized partial product demonstrates more power than existing methods for both the standard Gaussian mixture model.
Wednesday, April 7 at 12:30 PM
Title: Vintage Factor Analysis with Varimax Performs Statistical Inference
Abstract: Psychologists developed Multiple Factor Analysis to decompose multivariate data into a small number of interpretable factors without any a priori knowledge about those factors. In this form of factor analysis, the Varimax "factor rotation" is a key step to make the factors interpretable. Charles Spearman and many others objected to factor rotations because the factors seem to be rotationally invariant. This is an historical engima because factor rotations have survived and are widely popular because, empirically, they often make the factors easier to interpret. We argue that the rotation makes the factors easier to interpret because, in fact, the Varimax factor rotation performs statistical inference. We show that Principal Components Analysis (PCA) with the Varimax rotation provides a unified spectral estimation strategy for a broad class of modern factor models, including the Stochastic Blockmodel and a natural variation of Latent Dirichlet Allocation (i.e., "topic modeling"). In addition, we show that Thurstone's widely employed sparsity diagnostics implicitly assess a key "leptokurtic" condition that makes the rotation statistically identifiable in these models. Taken together, this shows that the know-how of Vintage Factor Analysis performs statistical inference, reversing nearly a century of statistical thinking on the topic. With a sparse eigensolver, PCA with Varimax is both fast and stable. Combined with Thurstone's straightforward diagnostics, this vintage approach is suitable for a wide array of modern applications.
Wednesday, April 14th at 12:30 PM
Title: Quantum Computation and Statistics
Abstract: Quantum computation and quantum information are of great current interest in fields such as computer science, physics, engineering, chemistry and mathematical sciences. They will likely lead to a new wave of technological innovations in communication, computation and cryptography. As the theory of quantum physics is fundamentally stochastic, randomness and uncertainty are deeply rooted in quantum computation and quantum information. Thus, statistics can play an important role in quantum computation, which in turn may offer great potential to revolutionize statistical computing and data science. This talk will first give a brief introduction on quantum computation and then present statistical work in quantum computing and its related research.
Wednesday, April 21st at 12:30 PM
Title: Fast and flexible estimation of effective migration surfaces.
Abstract: An important feature in spatial population genetic data is often “isolation-by-distance,” where genetic differentiation tends to increase as individuals become more geographically distant. Recently, Petkova et al. (2016) developed a statistical method called Estimating Effective Migration Surfaces (EEMS) for visualizing spatially heterogeneous isolation-by-distance on a geographic map. While EEMS is a powerful tool for depicting spatial population structure, it can suffer from slow runtimes. Here we develop a related method called Fast Estimation of Effective Migration Surfaces (FEEMS). FEEMS uses a Gaussian Markov Random Field in a penalized likelihood framework that allows for efficient optimization and output of effective migration surfaces. Further, the efficient optimization facilitates the inference of migration parameters per edge in the graph, rather than per node (as in EEMS). When tested with coalescent simulations, FEEMS accurately recovers effective migration surfaces with complex gene-flow histories, including those with anisotropy. Applications of FEEMS to population genetic data from North American gray wolves show it to perform comparably to EEMS, but with solutions obtained orders of magnitude faster. Overall, FEEMS expands the ability of users to quickly visualize and interpret spatial structure in their data.